---
license: other
language:
- en
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- visual-chain-of-thought
- visual-reasoning
- multimodal
- grounding
- spatial-reasoning
- qwen2_5_vl
datasets:
- Y-Research-Group/VisReason
---
# VisReason-Pro-Qwen2.5-VL-7B
The **main VisReason model** from our ECCV 2026 paper. Built on
**[VisReason-Qwen2.5-VL-7B](https://huggingface.co/Y-Research-Group/VisReason-Qwen2.5-VL-7B)**
and further trained on **VisReason-Pro** — the high-fidelity subset (~165K, the GQA portion)
produced under a stronger GPT-4.1-series annotator with **depth-informed 3D grounding** — to
strengthen spatially-grounded, multi-round visual Chain-of-Thought reasoning over small
objects and complex 2D/3D relations.
This checkpoint is the primary model evaluated across our benchmark suite (fine-grained
grounding, multi-round visual CoT, MME, POPE, V*).
## Training
- **Base model:** `Qwen/Qwen2.5-VL-7B-Instruct`
- **Method:** LoRA supervised fine-tuning — continued from the VisReason base model and
further trained on the VisReason-Pro subset; merged into the base weights
- **Data:** [VisReason](https://huggingface.co/datasets/Y-Research-Group/VisReason) +
VisReason-Pro (depth-grounded GQA subset)
- **Framework:** LLaMA-Factory
## Usage
The model is trained in a tool-calling chat format: it wraps reasoning in `...`,
optionally emits a single `image_zoom_in_tool` call with a **ratio-based** `bbox_2d`
(`[x1,y1,x2,y2]` in `[0,1]`) to crop the current view, and outputs the final answer in
`...`. Load with `transformers` (`Qwen2_5_VLForConditionalGeneration`) or
serve with vLLM, using the standard Qwen2.5-VL processor.
## Citation
```bibtex
@inproceedings{visreason2026,
title = {VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning},
author = {Lingxiao Li and Yifan Wang and Xinyan Gao and Chen Tang and Xiangyu Yue and Chenyu You},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}
```