--- license: apache-2.0 base_model: HuggingFaceTB/SmolVLM2-500M-Video-Instruct language: - en tags: - smolvlm - vision-language-model - depth-estimation - object-detection - spatial-reasoning - multimodal - depth-aware - metric-depth pipeline_tag: image-text-to-text --- # SmolVLM2-500M-DepthAwareVLM **SmolVLM2-500M-DepthAwareVLM** extends [SmolVLM2-500M-Video-Instruct](HuggingFaceTB/SmolVLM2-500M-Video-Instruct) with a lightweight sidecar pipeline that fuses **metric depth maps** (from Depth-Anything-V2) and **object detection anchors** (from YOLOv8-World) directly into the vision-language forward pass, enabling grounded spatial reasoning such as *"How far is the car?"* without any fine-tuning required for basic depth-hint prompting. --- ## Architecture ``` Image (RGB) | +----------+----------+ | | SigLIP ViT-SO/14 Depth-Anything-V2 (Vision Encoder) Metric-Outdoor-Small 86.4M params (external, not saved) | | Patch embeddings Depth map (H x W, metres) | | +----> DepthBridge <--+ <- NEW (262 K params) Gated residual fusion gate alpha = 0.0 at init, learns during fine-tuning | Connector (pixel-shuffle + MLP) 11.8M params | LM token sequence | [Optional] ObjectAnchorProjector <- NEW (498 K params) YOLOv8-World detections -> K anchor tokens appended | SmolLM2 Language Model (Llama backbone) 361.9M params | Answer ``` --- ## Parameter Breakdown | Component | Parameters | % of Total | |---|---|---| | Vision encoder (SigLIP) | 86,433,024 | 17.006% | | Connector (pixel-shuffle MLP) | 11,796,480 | 2.321% | | Language model (SmolLM2) | 361,944,000 | 71.215% | | **DepthBridge** (sidecar) | 262,913 | 0.052% | | **ObjectAnchorProjector** (sidecar) | 498,240 | 0.098% | | **Sidecar total** | **761,153** | **0.150%** | | **GRAND TOTAL** | **508,243,457** | 100% | The two sidecar modules add only **0.15%** of new parameters on top of the frozen 508M base model. --- ## Sidecar Modules ### 1. DepthBridge - **Input:** Metric depth map `(B, 1, H, W)` from Depth-Anything-V2-Metric-Outdoor-Small - **Architecture:** `Conv2d(1->256, k=16, s=16)` -> `LayerNorm(256)` -> `Linear(256->768)` - **Fusion:** Gated residual: `patch_emb = patch_emb + gate * depth_features` - **Gate alpha:** Initialised at **0.0** (depth is inactive at init, rises naturally during fine-tuning) - **Effect:** Vision patches receive metric depth context at the embedding level, before the connector ### 2. ObjectAnchorProjector - **Input:** YOLOv8-World detections — bounding boxes `(K, 4)` + CLIP class embeddings `(K, 512)` + depth `(K, 1)` - **Architecture:** `Linear(517->960)` -> `LayerNorm(960)` - **Fusion:** K anchor tokens appended to the LM input sequence after image-text merging - **Note:** Enable after fine-tuning. Random weights before training add noise; disable with `config.object_integration = False` --- ## Inference Pipeline ``` Input image |--- Depth-Anything-V2-Metric-Outdoor-Small ---> depth_map (H x W, metres) |--- YOLOv8-World (open-vocab) ----------------> boxes, class_emb, depth_vals | +-> SmolVLM2-500M-DepthAwareVLM (depth_map fused via DepthBridge) (detections passed as text hint pre-fine-tuning) | Answer: "The car is 10.81 metres away." ``` --- ## Usage ### Basic inference (PyTorch) ```python import torch import numpy as np from PIL import Image from transformers import AutoProcessor, AutoModelForImageTextToText from transformers.models.smolvlm.modeling_smolvlm import DepthBridge MODEL_ID = "anuragpradhan/SmolVLM2-500M-DepthAwareVLM" model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype=torch.float32) processor = AutoProcessor.from_pretrained(MODEL_ID) # depth_integration=True is already in the saved config # DepthBridge is reconstructed automatically by SmolVLMModel.__init__ image = Image.open("your_image.jpg").convert("RGB") messages = [ {"role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "What is happening in this scene?"}, ]} ] prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(images=image, text=prompt, return_tensors="pt") # Optional: pass a metric depth map (normalised to [0,1]) from Depth-Anything-V2 depth_map = inputs.pop("depth_pixel_values", None) with torch.no_grad(): output_ids = model.generate(**inputs, depth_maps=depth_map, max_new_tokens=200) n = inputs["input_ids"].shape[1] answer = processor.batch_decode(output_ids[:, n:], skip_special_tokens=True)[0].strip() print(answer) ``` ### Full sidecar demo ```bash # Clone repo and install editable transformers git clone https://github.com/huggingface/transformers cd transformers && pip install -e ".[dev]" pip install ultralytics num2words # Run the sidecar demo cd examples python sidecar_depth_demo.py your_image.jpg "What is the depth of the car?" ``` ### Fine-tuning (sidecar modules only) ```python from transformers import AutoModelForImageTextToText model = AutoModelForImageTextToText.from_pretrained("anuragpradhan/SmolVLM2-500M-DepthAwareVLM") # Freeze the 508M base model, train only the 761K sidecar params model.freeze_base_models() trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) print(f"Trainable params: {trainable:,}") # ~761,153 ``` --- ## External Models Required | Model | Purpose | HF ID | |---|---|---| | Depth-Anything-V2-Metric-Outdoor-Small | Metric depth map generation | `depth-anything/Depth-Anything-V2-Metric-Outdoor-Small-hf` | | YOLOv8-World | Open-vocabulary object detection | `yolov8s-world.pt` (ultralytics) | --- ## Config Flags | Flag | Default | Effect | |---|---|---| | `depth_integration` | `True` | Instantiates DepthBridge; passes depth maps through gated residual | | `object_integration` | `True` | Instantiates ObjectAnchorProjector; appends anchor tokens to sequence | | `depth_hidden_dim` | `256` | Intermediate channels in DepthBridge Conv2d | | `object_feature_dim` | `512` | CLIP embedding dimension from YOLOv8-World | | `max_objects` | `20` | Max YOLO detections per image | | `depth_gate_init` | `0.0` | Initial value of DepthBridge gate (0 = depth inactive at init) | --- ## Limitations - **Not fine-tuned for depth tasks.** DepthBridge gate alpha = 0.0 at initialisation; depth fusion is inactive until fine-tuned on metric-depth QA data. - **ObjectAnchorProjector is random-initialised.** Enabling it before fine-tuning adds noise; it is disabled by default for inference. - **Text hint dependency.** Pre-fine-tuning, depth information is injected via a text prompt hint (e.g. `"[Depth sensor] The car is 10.81 metres away."`). The model reads this textually. - **Base model limitations apply.** SmolVLM2-500M is a small model; complex spatial reasoning requires the sidecar fine-tuning stage. --- ## Citation ```bibtex @misc{smolvlm2-depthawarevlm, title = {SmolVLM2-500M-DepthAwareVLM: Sidecar Depth and Object Grounding for Vision-Language Models}, author = {Anurag Pradhan}, year = {2025}, url = {https://huggingface.co/anuragpradhan/SmolVLM2-500M-DepthAwareVLM}, note = {Built on SmolVLM2-500M-Video-Instruct with DepthBridge and ObjectAnchorProjector sidecar modules} } ``` --- ## Acknowledgements - [SmolVLM2](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) by HuggingFace - [Depth Anything V2](https://huggingface.co/depth-anything/Depth-Anything-V2-Metric-Outdoor-Small-hf) - [YOLOv8-World](https://github.com/ultralytics/ultralytics)