anuragpradhan commited on
Commit
6a7c0b1
·
verified ·
1 Parent(s): 6f3a64d

Add SmolVLM2-500M sidecar pipeline (DepthBridge + ObjectAnchorProjector)

Browse files
Files changed (2) hide show
  1. README.md +209 -17
  2. model.safetensors +1 -1
README.md CHANGED
@@ -1,41 +1,233 @@
1
  ---
2
  license: apache-2.0
3
  base_model: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
 
 
4
  tags:
5
  - smolvlm
 
6
  - depth-estimation
7
  - object-detection
 
8
  - multimodal
9
- - sidecar-pipeline
 
 
10
  ---
11
 
12
- # SmolVLM2-500M-Video-Instruct + Sidecar Pipeline
13
 
14
- This model extends [SmolVLM2-500M-Video-Instruct](HuggingFaceTB/SmolVLM2-500M-Video-Instruct) with two
15
- lightweight sidecar modules for grounded spatial reasoning:
 
 
 
16
 
17
- | Module | Params | Purpose |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  |---|---|---|
19
- | **DepthBridge** | ~760 K | Fuses Depth-Anything-V2-Metric depth maps into SigLIP patch tokens via a gated residual |
20
- | **ObjectAnchorProjector** | ~1.3 K | Projects YOLOv8-World CLIP embeddings into LM anchor tokens |
 
 
 
 
 
21
 
22
- ## Sidecar config flags
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
- ```python
25
- config.depth_integration = True # enables DepthBridge
26
- config.object_integration = True # enables ObjectAnchorProjector (train first)
27
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
- ## Loading
30
 
31
  ```python
 
 
 
32
  from transformers import AutoProcessor, AutoModelForImageTextToText
 
 
 
 
 
 
33
 
34
- model = AutoModelForImageTextToText.from_pretrained("anuragpradhan/SmolVLM2-500M-DepthAwareVLM")
35
- processor = AutoProcessor.from_pretrained("anuragpradhan/SmolVLM2-500M-DepthAwareVLM")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ```
37
 
38
- ## Fine-tuning note
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
 
40
- DepthBridge gate α is initialised at 0.0 (depth is inactive until trained).
41
- Run `model.freeze_base_models()` to train only the sidecar modules.
 
 
1
  ---
2
  license: apache-2.0
3
  base_model: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
4
+ language:
5
+ - en
6
  tags:
7
  - smolvlm
8
+ - vision-language-model
9
  - depth-estimation
10
  - object-detection
11
+ - spatial-reasoning
12
  - multimodal
13
+ - depth-aware
14
+ - metric-depth
15
+ pipeline_tag: image-text-to-text
16
  ---
17
 
18
+ # SmolVLM2-500M-DepthAwareVLM
19
 
20
+ **SmolVLM2-500M-DepthAwareVLM** extends [SmolVLM2-500M-Video-Instruct](HuggingFaceTB/SmolVLM2-500M-Video-Instruct)
21
+ with a lightweight sidecar pipeline that fuses **metric depth maps** (from Depth-Anything-V2) and
22
+ **object detection anchors** (from YOLOv8-World) directly into the vision-language forward pass,
23
+ enabling grounded spatial reasoning such as *"How far is the car?"* without any fine-tuning
24
+ required for basic depth-hint prompting.
25
 
26
+ ---
27
+
28
+ ## Architecture
29
+
30
+ ```
31
+ Image (RGB)
32
+ |
33
+ +----------+----------+
34
+ | |
35
+ SigLIP ViT-SO/14 Depth-Anything-V2
36
+ (Vision Encoder) Metric-Outdoor-Small
37
+ 86.4M params (external, not saved)
38
+ | |
39
+ Patch embeddings Depth map (H x W, metres)
40
+ | |
41
+ +----> DepthBridge <--+ <- NEW (262 K params)
42
+ Gated residual fusion
43
+ gate alpha = 0.0 at init, learns during fine-tuning
44
+ |
45
+ Connector (pixel-shuffle + MLP)
46
+ 11.8M params
47
+ |
48
+ LM token sequence
49
+ |
50
+ [Optional] ObjectAnchorProjector <- NEW (498 K params)
51
+ YOLOv8-World detections -> K anchor tokens appended
52
+ |
53
+ SmolLM2 Language Model (Llama backbone)
54
+ 361.9M params
55
+ |
56
+ Answer
57
+ ```
58
+
59
+ ---
60
+
61
+ ## Parameter Breakdown
62
+
63
+ | Component | Parameters | % of Total |
64
  |---|---|---|
65
+ | Vision encoder (SigLIP) | 86,433,024 | 17.006% |
66
+ | Connector (pixel-shuffle MLP) | 11,796,480 | 2.321% |
67
+ | Language model (SmolLM2) | 361,944,000 | 71.215% |
68
+ | **DepthBridge** (sidecar) | 262,913 | 0.052% |
69
+ | **ObjectAnchorProjector** (sidecar) | 498,240 | 0.098% |
70
+ | **Sidecar total** | **761,153** | **0.150%** |
71
+ | **GRAND TOTAL** | **508,243,457** | 100% |
72
 
73
+ The two sidecar modules add only **0.15%** of new parameters on top of the frozen 508M base model.
74
+
75
+ ---
76
+
77
+ ## Sidecar Modules
78
+
79
+ ### 1. DepthBridge
80
+ - **Input:** Metric depth map `(B, 1, H, W)` from Depth-Anything-V2-Metric-Outdoor-Small
81
+ - **Architecture:** `Conv2d(1->256, k=16, s=16)` -> `LayerNorm(256)` -> `Linear(256->768)`
82
+ - **Fusion:** Gated residual: `patch_emb = patch_emb + gate * depth_features`
83
+ - **Gate alpha:** Initialised at **0.0** (depth is inactive at init, rises naturally during fine-tuning)
84
+ - **Effect:** Vision patches receive metric depth context at the embedding level, before the connector
85
+
86
+ ### 2. ObjectAnchorProjector
87
+ - **Input:** YOLOv8-World detections — bounding boxes `(K, 4)` + CLIP class embeddings `(K, 512)` + depth `(K, 1)`
88
+ - **Architecture:** `Linear(517->960)` -> `LayerNorm(960)`
89
+ - **Fusion:** K anchor tokens appended to the LM input sequence after image-text merging
90
+ - **Note:** Enable after fine-tuning. Random weights before training add noise; disable with `config.object_integration = False`
91
+
92
+ ---
93
+
94
+ ## Inference Pipeline
95
 
 
 
 
96
  ```
97
+ Input image
98
+ |--- Depth-Anything-V2-Metric-Outdoor-Small ---> depth_map (H x W, metres)
99
+ |--- YOLOv8-World (open-vocab) ----------------> boxes, class_emb, depth_vals
100
+ |
101
+ +-> SmolVLM2-500M-DepthAwareVLM
102
+ (depth_map fused via DepthBridge)
103
+ (detections passed as text hint pre-fine-tuning)
104
+ |
105
+ Answer: "The car is 10.81 metres away."
106
+ ```
107
+
108
+ ---
109
+
110
+ ## Usage
111
 
112
+ ### Basic inference (PyTorch)
113
 
114
  ```python
115
+ import torch
116
+ import numpy as np
117
+ from PIL import Image
118
  from transformers import AutoProcessor, AutoModelForImageTextToText
119
+ from transformers.models.smolvlm.modeling_smolvlm import DepthBridge
120
+
121
+ MODEL_ID = "anuragpradhan/SmolVLM2-500M-DepthAwareVLM"
122
+
123
+ model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype=torch.float32)
124
+ processor = AutoProcessor.from_pretrained(MODEL_ID)
125
 
126
+ # depth_integration=True is already in the saved config
127
+ # DepthBridge is reconstructed automatically by SmolVLMModel.__init__
128
+
129
+ image = Image.open("your_image.jpg").convert("RGB")
130
+
131
+ messages = [
132
+ {"role": "user", "content": [
133
+ {"type": "image"},
134
+ {"type": "text", "text": "What is happening in this scene?"},
135
+ ]}
136
+ ]
137
+ prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
138
+ inputs = processor(images=image, text=prompt, return_tensors="pt")
139
+
140
+ # Optional: pass a metric depth map (normalised to [0,1]) from Depth-Anything-V2
141
+ depth_map = inputs.pop("depth_pixel_values", None)
142
+
143
+ with torch.no_grad():
144
+ output_ids = model.generate(**inputs, depth_maps=depth_map, max_new_tokens=200)
145
+
146
+ n = inputs["input_ids"].shape[1]
147
+ answer = processor.batch_decode(output_ids[:, n:], skip_special_tokens=True)[0].strip()
148
+ print(answer)
149
  ```
150
 
151
+ ### Full sidecar demo
152
+
153
+ ```bash
154
+ # Clone repo and install editable transformers
155
+ git clone https://github.com/huggingface/transformers
156
+ cd transformers && pip install -e ".[dev]"
157
+ pip install ultralytics num2words
158
+
159
+ # Run the sidecar demo
160
+ cd examples
161
+ python sidecar_depth_demo.py your_image.jpg "What is the depth of the car?"
162
+ ```
163
+
164
+ ### Fine-tuning (sidecar modules only)
165
+
166
+ ```python
167
+ from transformers import AutoModelForImageTextToText
168
+
169
+ model = AutoModelForImageTextToText.from_pretrained("anuragpradhan/SmolVLM2-500M-DepthAwareVLM")
170
+
171
+ # Freeze the 508M base model, train only the 761K sidecar params
172
+ model.freeze_base_models()
173
+
174
+ trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
175
+ print(f"Trainable params: {trainable:,}") # ~761,153
176
+ ```
177
+
178
+ ---
179
+
180
+ ## External Models Required
181
+
182
+ | Model | Purpose | HF ID |
183
+ |---|---|---|
184
+ | Depth-Anything-V2-Metric-Outdoor-Small | Metric depth map generation | `depth-anything/Depth-Anything-V2-Metric-Outdoor-Small-hf` |
185
+ | YOLOv8-World | Open-vocabulary object detection | `yolov8s-world.pt` (ultralytics) |
186
+
187
+ ---
188
+
189
+ ## Config Flags
190
+
191
+ | Flag | Default | Effect |
192
+ |---|---|---|
193
+ | `depth_integration` | `True` | Instantiates DepthBridge; passes depth maps through gated residual |
194
+ | `object_integration` | `True` | Instantiates ObjectAnchorProjector; appends anchor tokens to sequence |
195
+ | `depth_hidden_dim` | `256` | Intermediate channels in DepthBridge Conv2d |
196
+ | `object_feature_dim` | `512` | CLIP embedding dimension from YOLOv8-World |
197
+ | `max_objects` | `20` | Max YOLO detections per image |
198
+ | `depth_gate_init` | `0.0` | Initial value of DepthBridge gate (0 = depth inactive at init) |
199
+
200
+ ---
201
+
202
+ ## Limitations
203
+
204
+ - **Not fine-tuned for depth tasks.** DepthBridge gate alpha = 0.0 at initialisation; depth fusion is inactive
205
+ until fine-tuned on metric-depth QA data.
206
+ - **ObjectAnchorProjector is random-initialised.** Enabling it before fine-tuning adds noise; it is
207
+ disabled by default for inference.
208
+ - **Text hint dependency.** Pre-fine-tuning, depth information is injected via a text prompt hint
209
+ (e.g. `"[Depth sensor] The car is 10.81 metres away."`). The model reads this textually.
210
+ - **Base model limitations apply.** SmolVLM2-500M is a small model; complex spatial reasoning
211
+ requires the sidecar fine-tuning stage.
212
+
213
+ ---
214
+
215
+ ## Citation
216
+
217
+ ```bibtex
218
+ @misc{smolvlm2-depthawarevlm,
219
+ title = {SmolVLM2-500M-DepthAwareVLM: Sidecar Depth and Object Grounding for Vision-Language Models},
220
+ author = {Anurag Pradhan},
221
+ year = {2025},
222
+ url = {https://huggingface.co/anuragpradhan/SmolVLM2-500M-DepthAwareVLM},
223
+ note = {Built on SmolVLM2-500M-Video-Instruct with DepthBridge and ObjectAnchorProjector sidecar modules}
224
+ }
225
+ ```
226
+
227
+ ---
228
+
229
+ ## Acknowledgements
230
 
231
+ - [SmolVLM2](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) by HuggingFace
232
+ - [Depth Anything V2](https://huggingface.co/depth-anything/Depth-Anything-V2-Metric-Outdoor-Small-hf)
233
+ - [YOLOv8-World](https://github.com/ultralytics/ultralytics)
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:fad09e4f72fe9b68a9d96c49ce3076346137f8a6e996973cbe73a1c3cab241bf
3
  size 2033036156
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:52d1a4ba171ce0ea9df9f831a6dc43ad06e0bd34d1a28d5526f52b720a781a1a
3
  size 2033036156