Add SmolVLM2-500M sidecar pipeline (DepthBridge + ObjectAnchorProjector)

6a7c0b1 verified about 1 month ago

8.13 kB

	---
	license: apache-2.0
	base_model: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
	language:
	- en
	tags:
	- smolvlm
	- vision-language-model
	- depth-estimation
	- object-detection
	- spatial-reasoning
	- multimodal
	- depth-aware
	- metric-depth
	pipeline_tag: image-text-to-text
	---

	# SmolVLM2-500M-DepthAwareVLM

	SmolVLM2-500M-DepthAwareVLM extends [SmolVLM2-500M-Video-Instruct](HuggingFaceTB/SmolVLM2-500M-Video-Instruct)
	with a lightweight sidecar pipeline that fuses metric depth maps (from Depth-Anything-V2) and
	object detection anchors (from YOLOv8-World) directly into the vision-language forward pass,
	enabling grounded spatial reasoning such as "How far is the car?" without any fine-tuning
	required for basic depth-hint prompting.

	---

	## Architecture

	```
	Image (RGB)
	\|
	+----------+----------+
	\| \|
	SigLIP ViT-SO/14 Depth-Anything-V2
	(Vision Encoder) Metric-Outdoor-Small
	86.4M params (external, not saved)
	\| \|
	Patch embeddings Depth map (H x W, metres)
	\| \|
	+----> DepthBridge <--+ <- NEW (262 K params)
	Gated residual fusion
	gate alpha = 0.0 at init, learns during fine-tuning
	\|
	Connector (pixel-shuffle + MLP)
	11.8M params
	\|
	LM token sequence
	\|
	[Optional] ObjectAnchorProjector <- NEW (498 K params)
	YOLOv8-World detections -> K anchor tokens appended
	\|
	SmolLM2 Language Model (Llama backbone)
	361.9M params
	\|
	Answer
	```

	---

	## Parameter Breakdown

	\| Component \| Parameters \| % of Total \|
	\|---\|---\|---\|
	\| Vision encoder (SigLIP) \| 86,433,024 \| 17.006% \|
	\| Connector (pixel-shuffle MLP) \| 11,796,480 \| 2.321% \|
	\| Language model (SmolLM2) \| 361,944,000 \| 71.215% \|
	\| DepthBridge (sidecar) \| 262,913 \| 0.052% \|
	\| ObjectAnchorProjector (sidecar) \| 498,240 \| 0.098% \|
	\| Sidecar total \| 761,153 \| 0.150% \|
	\| GRAND TOTAL \| 508,243,457 \| 100% \|

	The two sidecar modules add only 0.15% of new parameters on top of the frozen 508M base model.

	---

	## Sidecar Modules

	### 1. DepthBridge
	- Input: Metric depth map `(B, 1, H, W)` from Depth-Anything-V2-Metric-Outdoor-Small
	- Architecture: `Conv2d(1->256, k=16, s=16)` -> `LayerNorm(256)` -> `Linear(256->768)`
	- Fusion: Gated residual: `patch_emb = patch_emb + gate * depth_features`
	- Gate alpha: Initialised at 0.0 (depth is inactive at init, rises naturally during fine-tuning)
	- Effect: Vision patches receive metric depth context at the embedding level, before the connector

	### 2. ObjectAnchorProjector
	- Input: YOLOv8-World detections — bounding boxes `(K, 4)` + CLIP class embeddings `(K, 512)` + depth `(K, 1)`
	- Architecture: `Linear(517->960)` -> `LayerNorm(960)`
	- Fusion: K anchor tokens appended to the LM input sequence after image-text merging
	- Note: Enable after fine-tuning. Random weights before training add noise; disable with `config.object_integration = False`

	---

	## Inference Pipeline

	```
	Input image
	\|--- Depth-Anything-V2-Metric-Outdoor-Small ---> depth_map (H x W, metres)
	\|--- YOLOv8-World (open-vocab) ----------------> boxes, class_emb, depth_vals
	\|
	+-> SmolVLM2-500M-DepthAwareVLM
	(depth_map fused via DepthBridge)
	(detections passed as text hint pre-fine-tuning)
	\|
	Answer: "The car is 10.81 metres away."
	```

	---

	## Usage

	### Basic inference (PyTorch)

	```python
	import torch
	import numpy as np
	from PIL import Image
	from transformers import AutoProcessor, AutoModelForImageTextToText
	from transformers.models.smolvlm.modeling_smolvlm import DepthBridge

	MODEL_ID = "anuragpradhan/SmolVLM2-500M-DepthAwareVLM"

	model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype=torch.float32)
	processor = AutoProcessor.from_pretrained(MODEL_ID)

	# depth_integration=True is already in the saved config
	# DepthBridge is reconstructed automatically by SmolVLMModel.__init__

	image = Image.open("your_image.jpg").convert("RGB")

	messages = [
	{"role": "user", "content": [
	{"type": "image"},
	{"type": "text", "text": "What is happening in this scene?"},
	]}
	]
	prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = processor(images=image, text=prompt, return_tensors="pt")

	# Optional: pass a metric depth map (normalised to [0,1]) from Depth-Anything-V2
	depth_map = inputs.pop("depth_pixel_values", None)

	with torch.no_grad():
	output_ids = model.generate(**inputs, depth_maps=depth_map, max_new_tokens=200)

	n = inputs["input_ids"].shape[1]
	answer = processor.batch_decode(output_ids[:, n:], skip_special_tokens=True)[0].strip()
	print(answer)
	```

	### Full sidecar demo

	```bash
	# Clone repo and install editable transformers
	git clone https://github.com/huggingface/transformers
	cd transformers && pip install -e ".[dev]"
	pip install ultralytics num2words

	# Run the sidecar demo
	cd examples
	python sidecar_depth_demo.py your_image.jpg "What is the depth of the car?"
	```

	### Fine-tuning (sidecar modules only)

	```python
	from transformers import AutoModelForImageTextToText

	model = AutoModelForImageTextToText.from_pretrained("anuragpradhan/SmolVLM2-500M-DepthAwareVLM")

	# Freeze the 508M base model, train only the 761K sidecar params
	model.freeze_base_models()

	trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
	print(f"Trainable params: {trainable:,}") # ~761,153
	```

	---

	## External Models Required

	\| Model \| Purpose \| HF ID \|
	\|---\|---\|---\|
	\| Depth-Anything-V2-Metric-Outdoor-Small \| Metric depth map generation \| `depth-anything/Depth-Anything-V2-Metric-Outdoor-Small-hf` \|
	\| YOLOv8-World \| Open-vocabulary object detection \| `yolov8s-world.pt` (ultralytics) \|

	---

	## Config Flags

	\| Flag \| Default \| Effect \|
	\|---\|---\|---\|
	\| `depth_integration` \| `True` \| Instantiates DepthBridge; passes depth maps through gated residual \|
	\| `object_integration` \| `True` \| Instantiates ObjectAnchorProjector; appends anchor tokens to sequence \|
	\| `depth_hidden_dim` \| `256` \| Intermediate channels in DepthBridge Conv2d \|
	\| `object_feature_dim` \| `512` \| CLIP embedding dimension from YOLOv8-World \|
	\| `max_objects` \| `20` \| Max YOLO detections per image \|
	\| `depth_gate_init` \| `0.0` \| Initial value of DepthBridge gate (0 = depth inactive at init) \|

	---

	## Limitations

	- Not fine-tuned for depth tasks. DepthBridge gate alpha = 0.0 at initialisation; depth fusion is inactive
	until fine-tuned on metric-depth QA data.
	- ObjectAnchorProjector is random-initialised. Enabling it before fine-tuning adds noise; it is
	disabled by default for inference.
	- Text hint dependency. Pre-fine-tuning, depth information is injected via a text prompt hint
	(e.g. `"[Depth sensor] The car is 10.81 metres away."`). The model reads this textually.
	- Base model limitations apply. SmolVLM2-500M is a small model; complex spatial reasoning
	requires the sidecar fine-tuning stage.

	---

	## Citation

	```bibtex
	@misc{smolvlm2-depthawarevlm,
	title = {SmolVLM2-500M-DepthAwareVLM: Sidecar Depth and Object Grounding for Vision-Language Models},
	author = {Anurag Pradhan},
	year = {2025},
	url = {https://huggingface.co/anuragpradhan/SmolVLM2-500M-DepthAwareVLM},
	note = {Built on SmolVLM2-500M-Video-Instruct with DepthBridge and ObjectAnchorProjector sidecar modules}
	}
	```

	---

	## Acknowledgements

	- [SmolVLM2](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) by HuggingFace
	- [Depth Anything V2](https://huggingface.co/depth-anything/Depth-Anything-V2-Metric-Outdoor-Small-hf)
	- [YOLOv8-World](https://github.com/ultralytics/ultralytics)