lingxiao2049 commited on
Commit
f04ec27
·
verified ·
1 Parent(s): 91db332

Add model card

Browse files
Files changed (1) hide show
  1. README.md +58 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ language:
4
+ - en
5
+ base_model:
6
+ - Qwen/Qwen2.5-VL-7B-Instruct
7
+ pipeline_tag: image-text-to-text
8
+ library_name: transformers
9
+ tags:
10
+ - visual-chain-of-thought
11
+ - visual-reasoning
12
+ - multimodal
13
+ - grounding
14
+ - spatial-reasoning
15
+ - qwen2_5_vl
16
+ datasets:
17
+ - Y-Research-Group/VisReason
18
+ ---
19
+
20
+ # VisReason-Pro-Qwen2.5-VL-7B
21
+
22
+ The **main VisReason model** from our ECCV 2026 paper. Built on
23
+ **[VisReason-Qwen2.5-VL-7B](https://huggingface.co/Y-Research-Group/VisReason-Qwen2.5-VL-7B)**
24
+ and further trained on **VisReason-Pro** — the high-fidelity subset (~165K, the GQA portion)
25
+ produced under a stronger GPT-4.1-series annotator with **depth-informed 3D grounding** — to
26
+ strengthen spatially-grounded, multi-round visual Chain-of-Thought reasoning over small
27
+ objects and complex 2D/3D relations.
28
+
29
+ This checkpoint is the primary model evaluated across our benchmark suite (fine-grained
30
+ grounding, multi-round visual CoT, MME, POPE, V*).
31
+
32
+ ## Training
33
+
34
+ - **Base model:** `Qwen/Qwen2.5-VL-7B-Instruct`
35
+ - **Method:** LoRA supervised fine-tuning — continued from the VisReason base model and
36
+ further trained on the VisReason-Pro subset; merged into the base weights
37
+ - **Data:** [VisReason](https://huggingface.co/datasets/Y-Research-Group/VisReason) +
38
+ VisReason-Pro (depth-grounded GQA subset)
39
+ - **Framework:** LLaMA-Factory
40
+
41
+ ## Usage
42
+
43
+ The model is trained in a tool-calling chat format: it wraps reasoning in `<think>...</think>`,
44
+ optionally emits a single `image_zoom_in_tool` call with a **ratio-based** `bbox_2d`
45
+ (`[x1,y1,x2,y2]` in `[0,1]`) to crop the current view, and outputs the final answer in
46
+ `<answer>...</answer>`. Load with `transformers` (`Qwen2_5_VLForConditionalGeneration`) or
47
+ serve with vLLM, using the standard Qwen2.5-VL processor.
48
+
49
+ ## Citation
50
+
51
+ ```bibtex
52
+ @inproceedings{visreason2026,
53
+ title = {VisReason: A Large-Scale Dataset for Visual Chain-of-Thought Reasoning},
54
+ author = {Lingxiao Li and Yifan Wang and Xinyan Gao and Chen Tang and Xiangyu Yue and Chenyu You},
55
+ booktitle = {European Conference on Computer Vision (ECCV)},
56
+ year = {2026}
57
+ }
58
+ ```