| ---
|
| license: apache-2.0
|
| base_model: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
|
| language:
|
| - en
|
| tags:
|
| - smolvlm
|
| - vision-language-model
|
| - depth-estimation
|
| - object-detection
|
| - spatial-reasoning
|
| - multimodal
|
| - depth-aware
|
| - metric-depth
|
| pipeline_tag: image-text-to-text
|
| ---
|
|
|
| # SmolVLM2-500M-DepthAwareVLM
|
|
|
| **SmolVLM2-500M-DepthAwareVLM** extends [SmolVLM2-500M-Video-Instruct](HuggingFaceTB/SmolVLM2-500M-Video-Instruct)
|
| with a lightweight sidecar pipeline that fuses **metric depth maps** (from Depth-Anything-V2) and
|
| **object detection anchors** (from YOLOv8-World) directly into the vision-language forward pass,
|
| enabling grounded spatial reasoning such as *"How far is the car?"* without any fine-tuning
|
| required for basic depth-hint prompting.
|
|
|
| ---
|
|
|
| ## Architecture
|
|
|
| ```
|
| Image (RGB)
|
| |
|
| +----------+----------+
|
| | |
|
| SigLIP ViT-SO/14 Depth-Anything-V2
|
| (Vision Encoder) Metric-Outdoor-Small
|
| 86.4M params (external, not saved)
|
| | |
|
| Patch embeddings Depth map (H x W, metres)
|
| | |
|
| +----> DepthBridge <--+ <- NEW (262 K params)
|
| Gated residual fusion
|
| gate alpha = 0.0 at init, learns during fine-tuning
|
| |
|
| Connector (pixel-shuffle + MLP)
|
| 11.8M params
|
| |
|
| LM token sequence
|
| |
|
| [Optional] ObjectAnchorProjector <- NEW (498 K params)
|
| YOLOv8-World detections -> K anchor tokens appended
|
| |
|
| SmolLM2 Language Model (Llama backbone)
|
| 361.9M params
|
| |
|
| Answer
|
| ```
|
|
|
| ---
|
|
|
| ## Parameter Breakdown
|
|
|
| | Component | Parameters | % of Total |
|
| |---|---|---|
|
| | Vision encoder (SigLIP) | 86,433,024 | 17.006% |
|
| | Connector (pixel-shuffle MLP) | 11,796,480 | 2.321% |
|
| | Language model (SmolLM2) | 361,944,000 | 71.215% |
|
| | **DepthBridge** (sidecar) | 262,913 | 0.052% |
|
| | **ObjectAnchorProjector** (sidecar) | 498,240 | 0.098% |
|
| | **Sidecar total** | **761,153** | **0.150%** |
|
| | **GRAND TOTAL** | **508,243,457** | 100% |
|
|
|
| The two sidecar modules add only **0.15%** of new parameters on top of the frozen 508M base model.
|
|
|
| ---
|
|
|
| ## Sidecar Modules
|
|
|
| ### 1. DepthBridge
|
| - **Input:** Metric depth map `(B, 1, H, W)` from Depth-Anything-V2-Metric-Outdoor-Small
|
| - **Architecture:** `Conv2d(1->256, k=16, s=16)` -> `LayerNorm(256)` -> `Linear(256->768)`
|
| - **Fusion:** Gated residual: `patch_emb = patch_emb + gate * depth_features`
|
| - **Gate alpha:** Initialised at **0.0** (depth is inactive at init, rises naturally during fine-tuning)
|
| - **Effect:** Vision patches receive metric depth context at the embedding level, before the connector
|
|
|
| ### 2. ObjectAnchorProjector
|
| - **Input:** YOLOv8-World detections — bounding boxes `(K, 4)` + CLIP class embeddings `(K, 512)` + depth `(K, 1)`
|
| - **Architecture:** `Linear(517->960)` -> `LayerNorm(960)`
|
| - **Fusion:** K anchor tokens appended to the LM input sequence after image-text merging
|
| - **Note:** Enable after fine-tuning. Random weights before training add noise; disable with `config.object_integration = False`
|
|
|
| ---
|
|
|
| ## Inference Pipeline
|
|
|
| ```
|
| Input image
|
| |--- Depth-Anything-V2-Metric-Outdoor-Small ---> depth_map (H x W, metres)
|
| |--- YOLOv8-World (open-vocab) ----------------> boxes, class_emb, depth_vals
|
| |
|
| +-> SmolVLM2-500M-DepthAwareVLM
|
| (depth_map fused via DepthBridge)
|
| (detections passed as text hint pre-fine-tuning)
|
| |
|
| Answer: "The car is 10.81 metres away."
|
| ```
|
|
|
| ---
|
|
|
| ## Usage
|
|
|
| ### Basic inference (PyTorch)
|
|
|
| ```python
|
| import torch
|
| import numpy as np
|
| from PIL import Image
|
| from transformers import AutoProcessor, AutoModelForImageTextToText
|
| from transformers.models.smolvlm.modeling_smolvlm import DepthBridge
|
|
|
| MODEL_ID = "anuragpradhan/SmolVLM2-500M-DepthAwareVLM"
|
|
|
| model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype=torch.float32)
|
| processor = AutoProcessor.from_pretrained(MODEL_ID)
|
|
|
| # depth_integration=True is already in the saved config
|
| # DepthBridge is reconstructed automatically by SmolVLMModel.__init__
|
|
|
| image = Image.open("your_image.jpg").convert("RGB")
|
|
|
| messages = [
|
| {"role": "user", "content": [
|
| {"type": "image"},
|
| {"type": "text", "text": "What is happening in this scene?"},
|
| ]}
|
| ]
|
| prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
| inputs = processor(images=image, text=prompt, return_tensors="pt")
|
|
|
| # Optional: pass a metric depth map (normalised to [0,1]) from Depth-Anything-V2
|
| depth_map = inputs.pop("depth_pixel_values", None)
|
|
|
| with torch.no_grad():
|
| output_ids = model.generate(**inputs, depth_maps=depth_map, max_new_tokens=200)
|
|
|
| n = inputs["input_ids"].shape[1]
|
| answer = processor.batch_decode(output_ids[:, n:], skip_special_tokens=True)[0].strip()
|
| print(answer)
|
| ```
|
|
|
| ### Full sidecar demo
|
|
|
| ```bash
|
| # Clone repo and install editable transformers
|
| git clone https://github.com/huggingface/transformers
|
| cd transformers && pip install -e ".[dev]"
|
| pip install ultralytics num2words
|
|
|
| # Run the sidecar demo
|
| cd examples
|
| python sidecar_depth_demo.py your_image.jpg "What is the depth of the car?"
|
| ```
|
|
|
| ### Fine-tuning (sidecar modules only)
|
|
|
| ```python
|
| from transformers import AutoModelForImageTextToText
|
|
|
| model = AutoModelForImageTextToText.from_pretrained("anuragpradhan/SmolVLM2-500M-DepthAwareVLM")
|
|
|
| # Freeze the 508M base model, train only the 761K sidecar params
|
| model.freeze_base_models()
|
|
|
| trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
|
| print(f"Trainable params: {trainable:,}") # ~761,153
|
| ```
|
|
|
| ---
|
|
|
| ## External Models Required
|
|
|
| | Model | Purpose | HF ID |
|
| |---|---|---|
|
| | Depth-Anything-V2-Metric-Outdoor-Small | Metric depth map generation | `depth-anything/Depth-Anything-V2-Metric-Outdoor-Small-hf` |
|
| | YOLOv8-World | Open-vocabulary object detection | `yolov8s-world.pt` (ultralytics) |
|
|
|
| ---
|
|
|
| ## Config Flags
|
|
|
| | Flag | Default | Effect |
|
| |---|---|---|
|
| | `depth_integration` | `True` | Instantiates DepthBridge; passes depth maps through gated residual |
|
| | `object_integration` | `True` | Instantiates ObjectAnchorProjector; appends anchor tokens to sequence |
|
| | `depth_hidden_dim` | `256` | Intermediate channels in DepthBridge Conv2d |
|
| | `object_feature_dim` | `512` | CLIP embedding dimension from YOLOv8-World |
|
| | `max_objects` | `20` | Max YOLO detections per image |
|
| | `depth_gate_init` | `0.0` | Initial value of DepthBridge gate (0 = depth inactive at init) |
|
|
|
| ---
|
|
|
| ## Limitations
|
|
|
| - **Not fine-tuned for depth tasks.** DepthBridge gate alpha = 0.0 at initialisation; depth fusion is inactive
|
| until fine-tuned on metric-depth QA data.
|
| - **ObjectAnchorProjector is random-initialised.** Enabling it before fine-tuning adds noise; it is
|
| disabled by default for inference.
|
| - **Text hint dependency.** Pre-fine-tuning, depth information is injected via a text prompt hint
|
| (e.g. `"[Depth sensor] The car is 10.81 metres away."`). The model reads this textually.
|
| - **Base model limitations apply.** SmolVLM2-500M is a small model; complex spatial reasoning
|
| requires the sidecar fine-tuning stage.
|
|
|
| ---
|
|
|
| ## Citation
|
|
|
| ```bibtex
|
| @misc{smolvlm2-depthawarevlm,
|
| title = {SmolVLM2-500M-DepthAwareVLM: Sidecar Depth and Object Grounding for Vision-Language Models},
|
| author = {Anurag Pradhan},
|
| year = {2025},
|
| url = {https://huggingface.co/anuragpradhan/SmolVLM2-500M-DepthAwareVLM},
|
| note = {Built on SmolVLM2-500M-Video-Instruct with DepthBridge and ObjectAnchorProjector sidecar modules}
|
| }
|
| ```
|
|
|
| ---
|
|
|
| ## Acknowledgements
|
|
|
| - [SmolVLM2](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) by HuggingFace
|
| - [Depth Anything V2](https://huggingface.co/depth-anything/Depth-Anything-V2-Metric-Outdoor-Small-hf)
|
| - [YOLOv8-World](https://github.com/ultralytics/ultralytics)
|
|
|