anuragpradhan
/

SmolVLM2-500M-DepthAwareVLM

@@ -1,41 +1,233 @@
 ---
 license: apache-2.0
 base_model: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
 tags:
   - smolvlm
   - depth-estimation
   - object-detection
   - multimodal
-  - sidecar-pipeline
 ---
-# SmolVLM2-500M-Video-Instruct + Sidecar Pipeline
-This model extends [SmolVLM2-500M-Video-Instruct](HuggingFaceTB/SmolVLM2-500M-Video-Instruct) with two
-lightweight sidecar modules for grounded spatial reasoning:
-| Module | Params | Purpose |
 |---|---|---|
-| **DepthBridge** | ~760 K | Fuses Depth-Anything-V2-Metric depth maps into SigLIP patch tokens via a gated residual |
-| **ObjectAnchorProjector** | ~1.3 K | Projects YOLOv8-World CLIP embeddings into LM anchor tokens |
-## Sidecar config flags
-```python
-config.depth_integration  = True   # enables DepthBridge
-config.object_integration = True   # enables ObjectAnchorProjector (train first)
 ```
-## Loading
 ```python
 from transformers import AutoProcessor, AutoModelForImageTextToText
-model     = AutoModelForImageTextToText.from_pretrained("anuragpradhan/SmolVLM2-500M-DepthAwareVLM")
-processor = AutoProcessor.from_pretrained("anuragpradhan/SmolVLM2-500M-DepthAwareVLM")
 ```
-## Fine-tuning note
-DepthBridge gate α is initialised at 0.0 (depth is inactive until trained).
-Run `model.freeze_base_models()` to train only the sidecar modules.

 ---
 license: apache-2.0
 base_model: HuggingFaceTB/SmolVLM2-500M-Video-Instruct
+language:
+  - en
 tags:
   - smolvlm
+  - vision-language-model
   - depth-estimation
   - object-detection
+  - spatial-reasoning
   - multimodal
+  - depth-aware
+  - metric-depth
+pipeline_tag: image-text-to-text
 ---
+# SmolVLM2-500M-DepthAwareVLM
+**SmolVLM2-500M-DepthAwareVLM** extends [SmolVLM2-500M-Video-Instruct](HuggingFaceTB/SmolVLM2-500M-Video-Instruct)
+with a lightweight sidecar pipeline that fuses **metric depth maps** (from Depth-Anything-V2) and
+**object detection anchors** (from YOLOv8-World) directly into the vision-language forward pass,
+enabling grounded spatial reasoning such as *"How far is the car?"* without any fine-tuning
+required for basic depth-hint prompting.
+---
+## Architecture
+```
+                  Image (RGB)
+                      |
+           +----------+----------+
+           |                     |
+    SigLIP ViT-SO/14         Depth-Anything-V2
+    (Vision Encoder)         Metric-Outdoor-Small
+    86.4M params             (external, not saved)
+           |                     |
+    Patch embeddings         Depth map (H x W, metres)
+           |                     |
+           +----> DepthBridge <--+   <- NEW (262 K params)
+                  Gated residual fusion
+                  gate alpha = 0.0 at init, learns during fine-tuning
+           |
+    Connector (pixel-shuffle + MLP)
+    11.8M params
+           |
+    LM token sequence
+           |
+    [Optional] ObjectAnchorProjector  <- NEW (498 K params)
+               YOLOv8-World detections -> K anchor tokens appended
+           |
+    SmolLM2 Language Model (Llama backbone)
+    361.9M params
+           |
+         Answer
+```
+---
+## Parameter Breakdown
+| Component | Parameters | % of Total |
 |---|---|---|
+| Vision encoder (SigLIP) | 86,433,024 | 17.006% |
+| Connector (pixel-shuffle MLP) | 11,796,480 | 2.321% |
+| Language model (SmolLM2) | 361,944,000 | 71.215% |
+| **DepthBridge** (sidecar) | 262,913 | 0.052% |
+| **ObjectAnchorProjector** (sidecar) | 498,240 | 0.098% |
+| **Sidecar total** | **761,153** | **0.150%** |
+| **GRAND TOTAL** | **508,243,457** | 100% |
+The two sidecar modules add only **0.15%** of new parameters on top of the frozen 508M base model.
+---
+## Sidecar Modules
+### 1. DepthBridge
+- **Input:** Metric depth map `(B, 1, H, W)` from Depth-Anything-V2-Metric-Outdoor-Small
+- **Architecture:** `Conv2d(1->256, k=16, s=16)` -> `LayerNorm(256)` -> `Linear(256->768)`
+- **Fusion:** Gated residual: `patch_emb = patch_emb + gate * depth_features`
+- **Gate alpha:** Initialised at **0.0** (depth is inactive at init, rises naturally during fine-tuning)
+- **Effect:** Vision patches receive metric depth context at the embedding level, before the connector
+### 2. ObjectAnchorProjector
+- **Input:** YOLOv8-World detections — bounding boxes `(K, 4)` + CLIP class embeddings `(K, 512)` + depth `(K, 1)`
+- **Architecture:** `Linear(517->960)` -> `LayerNorm(960)`
+- **Fusion:** K anchor tokens appended to the LM input sequence after image-text merging
+- **Note:** Enable after fine-tuning. Random weights before training add noise; disable with `config.object_integration = False`
+---
+## Inference Pipeline
 ```
+Input image
+   |--- Depth-Anything-V2-Metric-Outdoor-Small ---> depth_map  (H x W, metres)
+   |--- YOLOv8-World (open-vocab) ----------------> boxes, class_emb, depth_vals
+   |
+   +-> SmolVLM2-500M-DepthAwareVLM
+           (depth_map fused via DepthBridge)
+           (detections passed as text hint pre-fine-tuning)
+           |
+        Answer: "The car is 10.81 metres away."
+```
+---
+## Usage
+### Basic inference (PyTorch)
 ```python
+import torch
+import numpy as np
+from PIL import Image
 from transformers import AutoProcessor, AutoModelForImageTextToText
+from transformers.models.smolvlm.modeling_smolvlm import DepthBridge
+MODEL_ID = "anuragpradhan/SmolVLM2-500M-DepthAwareVLM"
+model     = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype=torch.float32)
+processor = AutoProcessor.from_pretrained(MODEL_ID)
+# depth_integration=True is already in the saved config
+# DepthBridge is reconstructed automatically by SmolVLMModel.__init__
+image = Image.open("your_image.jpg").convert("RGB")
+messages = [
+    {"role": "user", "content": [
+        {"type": "image"},
+        {"type": "text", "text": "What is happening in this scene?"},
+    ]}
+]
+prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = processor(images=image, text=prompt, return_tensors="pt")
+# Optional: pass a metric depth map (normalised to [0,1]) from Depth-Anything-V2
+depth_map = inputs.pop("depth_pixel_values", None)
+with torch.no_grad():
+    output_ids = model.generate(**inputs, depth_maps=depth_map, max_new_tokens=200)
+n = inputs["input_ids"].shape[1]
+answer = processor.batch_decode(output_ids[:, n:], skip_special_tokens=True)[0].strip()
+print(answer)
 ```
+### Full sidecar demo
+```bash
+# Clone repo and install editable transformers
+git clone https://github.com/huggingface/transformers
+cd transformers && pip install -e ".[dev]"
+pip install ultralytics num2words
+# Run the sidecar demo
+cd examples
+python sidecar_depth_demo.py your_image.jpg "What is the depth of the car?"
+```
+### Fine-tuning (sidecar modules only)
+```python
+from transformers import AutoModelForImageTextToText
+model = AutoModelForImageTextToText.from_pretrained("anuragpradhan/SmolVLM2-500M-DepthAwareVLM")
+# Freeze the 508M base model, train only the 761K sidecar params
+model.freeze_base_models()
+trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
+print(f"Trainable params: {trainable:,}")  # ~761,153
+```
+---
+## External Models Required
+| Model | Purpose | HF ID |
+|---|---|---|
+| Depth-Anything-V2-Metric-Outdoor-Small | Metric depth map generation | `depth-anything/Depth-Anything-V2-Metric-Outdoor-Small-hf` |
+| YOLOv8-World | Open-vocabulary object detection | `yolov8s-world.pt` (ultralytics) |
+---
+## Config Flags
+| Flag | Default | Effect |
+|---|---|---|
+| `depth_integration` | `True` | Instantiates DepthBridge; passes depth maps through gated residual |
+| `object_integration` | `True` | Instantiates ObjectAnchorProjector; appends anchor tokens to sequence |
+| `depth_hidden_dim` | `256` | Intermediate channels in DepthBridge Conv2d |
+| `object_feature_dim` | `512` | CLIP embedding dimension from YOLOv8-World |
+| `max_objects` | `20` | Max YOLO detections per image |
+| `depth_gate_init` | `0.0` | Initial value of DepthBridge gate (0 = depth inactive at init) |
+---
+## Limitations
+- **Not fine-tuned for depth tasks.** DepthBridge gate alpha = 0.0 at initialisation; depth fusion is inactive
+  until fine-tuned on metric-depth QA data.
+- **ObjectAnchorProjector is random-initialised.** Enabling it before fine-tuning adds noise; it is
+  disabled by default for inference.
+- **Text hint dependency.** Pre-fine-tuning, depth information is injected via a text prompt hint
+  (e.g. `"[Depth sensor] The car is 10.81 metres away."`). The model reads this textually.
+- **Base model limitations apply.** SmolVLM2-500M is a small model; complex spatial reasoning
+  requires the sidecar fine-tuning stage.
+---
+## Citation
+```bibtex
+@misc{smolvlm2-depthawarevlm,
+  title   = {SmolVLM2-500M-DepthAwareVLM: Sidecar Depth and Object Grounding for Vision-Language Models},
+  author  = {Anurag Pradhan},
+  year    = {2025},
+  url     = {https://huggingface.co/anuragpradhan/SmolVLM2-500M-DepthAwareVLM},
+  note    = {Built on SmolVLM2-500M-Video-Instruct with DepthBridge and ObjectAnchorProjector sidecar modules}
+}
+```
+---
+## Acknowledgements
+- [SmolVLM2](https://huggingface.co/HuggingFaceTB/SmolVLM2-500M-Video-Instruct) by HuggingFace
+- [Depth Anything V2](https://huggingface.co/depth-anything/Depth-Anything-V2-Metric-Outdoor-Small-hf)
+- [YOLOv8-World](https://github.com/ultralytics/ultralytics)

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:fad09e4f72fe9b68a9d96c49ce3076346137f8a6e996973cbe73a1c3cab241bf
 size 2033036156

 version https://git-lfs.github.com/spec/v1
+oid sha256:52d1a4ba171ce0ea9df9f831a6dc43ad06e0bd34d1a28d5526f52b720a781a1a
 size 2033036156