XiaofengShi commited on
Commit
5e01316
·
verified ·
1 Parent(s): bc3a13a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +10 -10
README.md CHANGED
@@ -21,7 +21,7 @@ The **SFT checkpoint** of **MechVL** — the domain-specialized multimodal model
21
  > **MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding** (ICML 2026)
22
 
23
  [![arXiv](https://img.shields.io/badge/arXiv-2605.30794-b31b1b.svg)](https://arxiv.org/abs/2605.30794)
24
- [![GitHub](https://img.shields.io/badge/Code-GitHub-181717.svg)](https://github.com/xiaofengShi/MechVQA)
25
 
26
  ## Model description
27
 
@@ -33,21 +33,21 @@ MechVL-4B-SFT is initialized from `Qwen3-VL-4B-Instruct` and trained with **full
33
  | Architecture | Qwen3VLForConditionalGeneration |
34
  | Stage | 1 / 2 — SFT (→ RL) |
35
  | MechVQA Total | **76.36** |
36
- | RL checkpoint | [xiaofengalg/MechVL-4B-RL](https://modelscope.cn/models/xiaofengalg/MechVL-4B-RL) |
37
 
38
- ## Usage (ModelScope)
39
 
40
  ```python
41
  import torch
42
- from modelscope import AutoModelForCausalLM, AutoProcessor
43
 
44
- model = AutoModelForCausalLM.from_pretrained(
45
- "xiaofengalg/MechVL-4B-SFT", torch_dtype=torch.bfloat16, device_map="auto"
46
  )
47
- processor = AutoProcessor.from_pretrained("xiaofengalg/MechVL-4B-SFT")
48
 
49
  messages = [{"role": "user", "content": [
50
- {"type": "image", "image": "path/to/drawing.png"},
51
  {"type": "text", "text": "图纸中标注的零件总长度是多少?"},
52
  ]}]
53
  inputs = processor.apply_chat_template(
@@ -58,11 +58,11 @@ out = model.generate(**inputs, max_new_tokens=1024)
58
  print(processor.decode(out[0], skip_special_tokens=True))
59
  ```
60
 
61
- Also available on [HuggingFace](https://huggingface.co/MonteXiaofeng/MechVL-4B-SFT). For batch vLLM inference, see [`scripts/batch_infer.py`](https://github.com/xiaofengShi/MechVQA/blob/main/scripts/batch_infer.py).
62
 
63
  ## Training
64
 
65
- Full-parameter SFT on the LLM module (vision tower frozen) over the MechVQA training split. See [§4.1 of the paper](https://arxiv.org/abs/2605.30794).
66
 
67
  ## Citation
68
 
 
21
  > **MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding** (ICML 2026)
22
 
23
  [![arXiv](https://img.shields.io/badge/arXiv-2605.30794-b31b1b.svg)](https://arxiv.org/abs/2605.30794)
24
+ [![Code](https://img.shields.io/badge/Code-GitHub-181717.svg)](https://github.com/xiaofengShi/MechVQA)
25
 
26
  ## Model description
27
 
 
33
  | Architecture | Qwen3VLForConditionalGeneration |
34
  | Stage | 1 / 2 — SFT (→ RL) |
35
  | MechVQA Total | **76.36** |
36
+ | RL checkpoint | [MonteXiaofeng/MechVL-4B-RL](https://huggingface.co/MonteXiaofeng/MechVL-4B-RL) |
37
 
38
+ ## Usage (transformers)
39
 
40
  ```python
41
  import torch
42
+ from transformers import AutoProcessor, AutoModelForImageTextToText
43
 
44
+ model = AutoModelForImageTextToText.from_pretrained(
45
+ "MonteXiaofeng/MechVL-4B-SFT", dtype=torch.bfloat16, device_map="auto"
46
  )
47
+ processor = AutoProcessor.from_pretrained("MonteXiaofeng/MechVL-4B-SFT")
48
 
49
  messages = [{"role": "user", "content": [
50
+ {"type": "image", "url": "path/to/drawing.png"},
51
  {"type": "text", "text": "图纸中标注的零件总长度是多少?"},
52
  ]}]
53
  inputs = processor.apply_chat_template(
 
58
  print(processor.decode(out[0], skip_special_tokens=True))
59
  ```
60
 
61
+ For **batch vLLM inference** (SFT/RL dual-mode), see [`scripts/batch_infer.py`](https://github.com/xiaofengShi/MechVQA/blob/main/scripts/batch_infer.py).
62
 
63
  ## Training
64
 
65
+ Full-parameter SFT on the LLM module (vision tower frozen) over the MechVQA training split, with a unified response schema (rationale + concise final answer). See [§4.1 of the paper](https://arxiv.org/abs/2605.30794).
66
 
67
  ## Citation
68