--- license: apache-2.0 language: - en base_model: - Qwen/Qwen2.5-VL-3B-Instruct tags: - video-understanding - video-llm - streaming-video - arxiv:2603.12262 --- # VST-3B **Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously** [📄 Paper](https://arxiv.org/abs/2603.12262) | [🌐 Project Page](https://1ranguan.github.io/VST/) | [💻 Code](https://github.com/1ranguan/VST) | [🤗 Training Data](https://huggingface.co/datasets/Catalan258/VST-Training-Data) This is the **3B** variant of **Video Streaming Thinking (VST)**, a new paradigm for streaming video understanding that interleaves active reasoning with continuous video consumption, enabling amortized test-time scaling with real-time responsiveness. ## Performance | Model | OVO-Bench | StreamingBench | VideoMME | LongVideoBench | VideoHolmes | |---|---|---|---|---|---| | **VST-3B** | 56.2 | 75.5 | 59.5 | 54.1 | 36.1 | ## Citation ```bibtex @article{guan2026videostreamingthinking, title={Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously}, author={Yiran Guan and Liang Yin and Dingkang Liang and Jianzhong Ju and Zhenbo Luo and Jian Luan and Yuliang Liu and Xiang Bai}, journal={arXiv preprint arXiv:2603.12262}, year={2026}, } ```