How to use from the
Use from the
Transformers library
# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Y-Research-Group/VisReason-Qwen2.5-VL-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)
# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Y-Research-Group/VisReason-Qwen2.5-VL-7B")
model = AutoModelForMultimodalLM.from_pretrained("Y-Research-Group/VisReason-Qwen2.5-VL-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Quick Links

VisReason-Qwen2.5-VL-7B

Qwen2.5-VL-7B-Instruct fine-tuned on the VisReason dataset to perform human-like, global-to-local visual Chain-of-Thought reasoning: the model forms a holistic hypothesis, then iteratively zooms into salient regions (areas of interest) to gather fine-grained visual evidence before producing a grounded final answer.

This is the base VisReason model (the baseline checkpoint in our experiments) used in the ECCV 2026 paper. For the depth-grounded variant with stronger spatial reasoning, see VisReason-Pro-Qwen2.5-VL-7B.

Training

  • Base model: Qwen/Qwen2.5-VL-7B-Instruct
  • Method: LoRA supervised fine-tuning (2 epochs), then merged into the base weights
  • Data: VisReason training set (~489K multi-round visual-CoT examples)
  • Framework: LLaMA-Factory

Usage

The model is trained in a tool-calling chat format: it wraps reasoning in <think>...</think>, optionally emits a single image_zoom_in_tool call with a ratio-based bbox_2d ([x1,y1,x2,y2] in [0,1]) to crop the current view, and outputs the final answer in <answer>...</answer>. Load with transformers (Qwen2_5_VLForConditionalGeneration) or serve with vLLM, using the standard Qwen2.5-VL processor.

Citation

@inproceedings{visreason2026,
  title     = {VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning},
  author    = {Lingxiao Li and Yifan Wang and Xinyan Gao and Chen Tang and Xiangyu Yue and Chenyu You},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}
Downloads last month
22
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Y-Research-Group/VisReason-Qwen2.5-VL-7B

Finetuned
(1118)
this model
Quantizations
2 models

Dataset used to train Y-Research-Group/VisReason-Qwen2.5-VL-7B