---
license: other
language:
- en
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- visual-chain-of-thought
- visual-reasoning
- multimodal
- grounding
- spatial-reasoning
- qwen2_5_vl
datasets:
- Y-Research-Group/VisReason
---

# VisReason-Pro-Qwen2.5-VL-7B

The **main VisReason model** from our ECCV 2026 paper. Built on
**[VisReason-Qwen2.5-VL-7B](https://huggingface.co/Y-Research-Group/VisReason-Qwen2.5-VL-7B)**
and further trained on **VisReason-Pro** — the high-fidelity subset (~165K, the GQA portion)
produced under a stronger GPT-4.1-series annotator with **depth-informed 3D grounding** — to
strengthen spatially-grounded, multi-round visual Chain-of-Thought reasoning over small
objects and complex 2D/3D relations.

This checkpoint is the primary model evaluated across our benchmark suite (fine-grained
grounding, multi-round visual CoT, MME, POPE, V*).

## Training

- **Base model:** `Qwen/Qwen2.5-VL-7B-Instruct`
- **Method:** LoRA supervised fine-tuning — continued from the VisReason base model and
  further trained on the VisReason-Pro subset; merged into the base weights
- **Data:** [VisReason](https://huggingface.co/datasets/Y-Research-Group/VisReason) +
  VisReason-Pro (depth-grounded GQA subset)
- **Framework:** LLaMA-Factory

## Usage

The model is trained in a tool-calling chat format: it wraps reasoning in `<think>...</think>`,
optionally emits a single `image_zoom_in_tool` call with a **ratio-based** `bbox_2d`
(`[x1,y1,x2,y2]` in `[0,1]`) to crop the current view, and outputs the final answer in
`<answer>...</answer>`. Load with `transformers` (`Qwen2_5_VLForConditionalGeneration`) or
serve with vLLM, using the standard Qwen2.5-VL processor.

## Citation

```bibtex
@inproceedings{visreason2026,
  title     = {VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning},
  author    = {Lingxiao Li and Yifan Wang and Xinyan Gao and Chen Tang and Xiangyu Yue and Chenyu You},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}
```