---
license: apache-2.0
library_name: transformers
pipeline_tag: robotics
tags:
- robotics
- vision-language-action
- vla
- manipulation
- qwen3-vl
- depth
- 3d-trajectory
base_model:
- Qwen/Qwen3-VL-8B-Instruct
---

# 3D HAMSTER : Bridging Planning and Control in Hierarchical Vision Language Action Models through 3D Trajectory Guidance

🎉 **Accepted to IROS 2026** — IEEE/RSJ International Conference on Intelligent Robots and Systems

📄 [arXiv](https://arxiv.org/abs/2606.31329) &nbsp;·&nbsp; 🌐 [Project Page](https://davian-robotics.github.io/3D_HAMSTER/) &nbsp;·&nbsp; 💻 [Code](https://github.com/DAVIAN-Robotics/3D_HAMSTER)

3D HAMSTER is a **depth-aware VLM planner** that predicts metrically grounded **3D end-effector trajectories** directly from a single RGB-D observation and a language instruction. Unlike 2D planners whose pixel waypoints inherit whatever depth lies beneath them, 3D HAMSTER plans in metric 3D space, so the trajectory stays geometrically grounded and can feed straight into a point-cloud low-level policy.

This repository hosts the **planner checkpoint** — a single self-contained checkpoint (9B, bf16) that bundles the Qwen3-VL LLM, the vision encoder, the geometry merger, **and** the frozen LingBot-Depth geometry encoder weights.

## Usage

This is a **custom architecture** (`Qwen3VLGeometryForConditionalGeneration`) and requires the
[`hamster3d`](https://github.com/DAVIAN-Robotics/3D_HAMSTER) package (which vendors the geometry-encoder code). No separate LingBot-Depth download is needed — the encoder code is in the package and its weights are in this checkpoint.

```bash
# 1. Install the inference code
git clone https://github.com/DAVIAN-Robotics/3D_HAMSTER.git
cd 3D_HAMSTER && pip install -e .

# 2. Download this checkpoint into ./ckpt
hf download DAVIAN-Robotics/3D_HAMSTER --local-dir ckpt
```

```python
from hamster3d.inference import Hamster3DPredictor
import numpy as np
from PIL import Image

predictor = Hamster3DPredictor("ckpt/")          # device="cuda:0", bf16 by default

rgb = Image.open("examples/sample_0_rgb.png")
depth = np.load("examples/sample_0_depth.npy")   # float32, meters, shape (H, W)
instruction = open("examples/sample_0_instruction.txt").read().strip()

result = predictor.predict(rgb, depth, instruction)   # 3D trajectory prediction
print(result["waypoints"])   # [[u, v, depth], ...]  pixel u,v (0-1000) + metric depth (m)
print(result["actions"])     # ["Close Gripper", None, ..., "Open Gripper"]
```

**Inputs:** an RGB image (any resolution; auto-resized to 640 px longest edge) + a **metric depth map** (`float32`, meters, aligned to the RGB frame) + a language instruction.
**Output:** a metric 3D end-effector trajectory — `[u, v, depth]` waypoints with per-waypoint gripper actions.

> The Gradio demo in the [code repo](https://github.com/DAVIAN-Robotics/3D_HAMSTER) additionally supports **2D Trajectory, 2D/3D Pointing, 2D Bounding Box, and General VQA** task styles.
>
> **Download tip:** if `hf download` stalls, disable the xet backend: `HF_HUB_DISABLE_XET=1 hf download DAVIAN-Robotics/3D_HAMSTER --local-dir ckpt`.

## Model Details

| Component | Details |
|---|---|
| Base VLM | Qwen3-VL-8B (Stage-1 pretrained) |
| Geometry encoder | LingBot-Depth — DINOv2 ViT-L/14, **frozen** (~306M params) |
| Fusion | `resize_and_add` (element-wise add after spatial alignment) |
| Training | LoRA (rank 64, α 128) on the LLM + fully trained merger/decoder, with a dense depth-reconstruction loss |
| Precision | bfloat16 |

See the [project page](https://davian-robotics.github.io/3D_HAMSTER/) for benchmarks and qualitative results.

## Acknowledgments & Licensing

Released under the **Apache License 2.0**. 3D HAMSTER builds on:

- **[Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)** — base vision-language model.
- **[LingBot-Depth](https://huggingface.co/robbyant/lingbot-depth-pretrain-vitl-14)** (Robbyant, Apache-2.0) — the geometry encoder; its frozen weights are bundled in this checkpoint and its code is vendored in the [`hamster3d`](https://github.com/DAVIAN-Robotics/3D_HAMSTER) package.
- **[DINOv2](https://github.com/facebookresearch/dinov2)** (Meta AI, Apache-2.0) — backbone of the LingBot-Depth encoder.

All bundled components are Apache-2.0; their attributions are retained.

## Citation

```bibtex
@INPROCEEDINGS{hwang20263dhamster,
  author={Hwang, Dongyoon and Lee, Byungkun and Kim, Dongjin and Jang, Hyojin and Jin, Hoiyeong and Mun, Jueun and Park, Minho and Lee, Hojoon and Kim, Hyunseung and Choo, Jaegul},
  booktitle={2026 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
  title={{3D HAMSTER}: Bridging Planning and Control in Hierarchical Vision Language Action Models through {3D} Trajectory Guidance},
  year={2026}}
```