Instructions to use Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B")
model = AutoModelForMultimodalLM.from_pretrained("Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B

SGLang

How to use Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B with Docker Model Runner:
```
docker model run hf.co/Y-Research-Group/VisReason-Pro-Qwen2.5-VL-7B
```

lingxiao2049 commited on 6 days ago

Commit

f04ec27

verified ·

1 Parent(s): 91db332

Add model card

Browse files

Files changed (1) hide show

README.md +58 -0

README.md ADDED Viewed

	@@ -0,0 +1,58 @@

+---
+license: other
+language:
+- en
+base_model:
+- Qwen/Qwen2.5-VL-7B-Instruct
+pipeline_tag: image-text-to-text
+library_name: transformers
+tags:
+- visual-chain-of-thought
+- visual-reasoning
+- multimodal
+- grounding
+- spatial-reasoning
+- qwen2_5_vl
+datasets:
+- Y-Research-Group/VisReason
+---
+# VisReason-Pro-Qwen2.5-VL-7B
+The **main VisReason model** from our ECCV 2026 paper. Built on
+**[VisReason-Qwen2.5-VL-7B](https://huggingface.co/Y-Research-Group/VisReason-Qwen2.5-VL-7B)**
+and further trained on **VisReason-Pro** — the high-fidelity subset (~165K, the GQA portion)
+produced under a stronger GPT-4.1-series annotator with **depth-informed 3D grounding** — to
+strengthen spatially-grounded, multi-round visual Chain-of-Thought reasoning over small
+objects and complex 2D/3D relations.
+This checkpoint is the primary model evaluated across our benchmark suite (fine-grained
+grounding, multi-round visual CoT, MME, POPE, V*).
+## Training
+- **Base model:** `Qwen/Qwen2.5-VL-7B-Instruct`
+- **Method:** LoRA supervised fine-tuning — continued from the VisReason base model and
+  further trained on the VisReason-Pro subset; merged into the base weights
+- **Data:** [VisReason](https://huggingface.co/datasets/Y-Research-Group/VisReason) +
+  VisReason-Pro (depth-grounded GQA subset)
+- **Framework:** LLaMA-Factory
+## Usage
+The model is trained in a tool-calling chat format: it wraps reasoning in `<think>...</think>`,
+optionally emits a single `image_zoom_in_tool` call with a **ratio-based** `bbox_2d`
+(`[x1,y1,x2,y2]` in `[0,1]`) to crop the current view, and outputs the final answer in
+`<answer>...</answer>`. Load with `transformers` (`Qwen2_5_VLForConditionalGeneration`) or
+serve with vLLM, using the standard Qwen2.5-VL processor.
+## Citation
+```bibtex
+@inproceedings{visreason2026,
+  title     = {VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning},
+  author    = {Lingxiao Li and Yifan Wang and Xinyan Gao and Chen Tang and Xiangyu Yue and Chenyu You},
+  booktitle = {European Conference on Computer Vision (ECCV)},
+  year      = {2026}
+}
+```