---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
tags:
- video-understanding
- video-llm
- streaming-video
- arxiv:2603.12262
---

# VST-3B

**Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously**

[📄 Paper](https://arxiv.org/abs/2603.12262) | [🌐 Project Page](https://1ranguan.github.io/VST/) | [💻 Code](https://github.com/1ranguan/VST) | [🤗 Training Data](https://huggingface.co/datasets/Catalan258/VST-Training-Data)

This is the **3B** variant of **Video Streaming Thinking (VST)**, a new paradigm for streaming video understanding that interleaves active reasoning with continuous video consumption, enabling amortized test-time scaling with real-time responsiveness.

## Performance

| Model | OVO-Bench | StreamingBench | VideoMME | LongVideoBench | VideoHolmes |
|---|---|---|---|---|---|
| **VST-3B** | 56.2 | 75.5 | 59.5 | 54.1 | 36.1 |

## Citation

```bibtex
@article{guan2026videostreamingthinking,
      title={Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously},
      author={Yiran Guan and Liang Yin and Dingkang Liang and Jianzhong Ju and Zhenbo Luo and Jian Luan and Yuliang Liu and Xiang Bai},
      journal={arXiv preprint arXiv:2603.12262},
      year={2026},
}
```