--- license: other language: - en base_model: - Qwen/Qwen2.5-VL-7B-Instruct pipeline_tag: image-text-to-text library_name: transformers tags: - visual-chain-of-thought - visual-reasoning - multimodal - grounding - spatial-reasoning - qwen2_5_vl datasets: - Y-Research-Group/VisReason --- # VisReason-Pro-Qwen2.5-VL-7B The **main VisReason model** from our ECCV 2026 paper. Built on **[VisReason-Qwen2.5-VL-7B](https://huggingface.co/Y-Research-Group/VisReason-Qwen2.5-VL-7B)** and further trained on **VisReason-Pro** — the high-fidelity subset (~165K, the GQA portion) produced under a stronger GPT-4.1-series annotator with **depth-informed 3D grounding** — to strengthen spatially-grounded, multi-round visual Chain-of-Thought reasoning over small objects and complex 2D/3D relations. This checkpoint is the primary model evaluated across our benchmark suite (fine-grained grounding, multi-round visual CoT, MME, POPE, V*). ## Training - **Base model:** `Qwen/Qwen2.5-VL-7B-Instruct` - **Method:** LoRA supervised fine-tuning — continued from the VisReason base model and further trained on the VisReason-Pro subset; merged into the base weights - **Data:** [VisReason](https://huggingface.co/datasets/Y-Research-Group/VisReason) + VisReason-Pro (depth-grounded GQA subset) - **Framework:** LLaMA-Factory ## Usage The model is trained in a tool-calling chat format: it wraps reasoning in `...`, optionally emits a single `image_zoom_in_tool` call with a **ratio-based** `bbox_2d` (`[x1,y1,x2,y2]` in `[0,1]`) to crop the current view, and outputs the final answer in `...`. Load with `transformers` (`Qwen2_5_VLForConditionalGeneration`) or serve with vLLM, using the standard Qwen2.5-VL processor. ## Citation ```bibtex @inproceedings{visreason2026, title = {VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning}, author = {Lingxiao Li and Yifan Wang and Xinyan Gao and Chen Tang and Xiangyu Yue and Chenyu You}, booktitle = {European Conference on Computer Vision (ECCV)}, year = {2026} } ```