Visual Question Answering
Transformers
Safetensors
Chinese
English
qwen3_vl
image-text-to-text
multimodal
vision-language
mechanical-drawing
vqa
mechvqa
Instructions to use XiaofengAlg/MechVL-4B-SFT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use XiaofengAlg/MechVL-4B-SFT with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("visual-question-answering", model="XiaofengAlg/MechVL-4B-SFT")# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("XiaofengAlg/MechVL-4B-SFT") model = AutoModelForMultimodalLM.from_pretrained("XiaofengAlg/MechVL-4B-SFT") - Notebooks
- Google Colab
- Kaggle
XiaofengShi commited on
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -21,7 +21,7 @@ The **SFT checkpoint** of **MechVL** — the domain-specialized multimodal model
|
|
| 21 |
> **MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding** (ICML 2026)
|
| 22 |
|
| 23 |
[](https://arxiv.org/abs/2605.30794)
|
| 24 |
-
[![
|
| 25 |
|
| 26 |
## Model description
|
| 27 |
|
|
@@ -33,21 +33,21 @@ MechVL-4B-SFT is initialized from `Qwen3-VL-4B-Instruct` and trained with **full
|
|
| 33 |
| Architecture | Qwen3VLForConditionalGeneration |
|
| 34 |
| Stage | 1 / 2 — SFT (→ RL) |
|
| 35 |
| MechVQA Total | **76.36** |
|
| 36 |
-
| RL checkpoint | [
|
| 37 |
|
| 38 |
-
## Usage (
|
| 39 |
|
| 40 |
```python
|
| 41 |
import torch
|
| 42 |
-
from
|
| 43 |
|
| 44 |
-
model =
|
| 45 |
-
"
|
| 46 |
)
|
| 47 |
-
processor = AutoProcessor.from_pretrained("
|
| 48 |
|
| 49 |
messages = [{"role": "user", "content": [
|
| 50 |
-
{"type": "image", "
|
| 51 |
{"type": "text", "text": "图纸中标注的零件总长度是多少?"},
|
| 52 |
]}]
|
| 53 |
inputs = processor.apply_chat_template(
|
|
@@ -58,11 +58,11 @@ out = model.generate(**inputs, max_new_tokens=1024)
|
|
| 58 |
print(processor.decode(out[0], skip_special_tokens=True))
|
| 59 |
```
|
| 60 |
|
| 61 |
-
|
| 62 |
|
| 63 |
## Training
|
| 64 |
|
| 65 |
-
Full-parameter SFT on the LLM module (vision tower frozen) over the MechVQA training split. See [§4.1 of the paper](https://arxiv.org/abs/2605.30794).
|
| 66 |
|
| 67 |
## Citation
|
| 68 |
|
|
|
|
| 21 |
> **MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding** (ICML 2026)
|
| 22 |
|
| 23 |
[](https://arxiv.org/abs/2605.30794)
|
| 24 |
+
[](https://github.com/xiaofengShi/MechVQA)
|
| 25 |
|
| 26 |
## Model description
|
| 27 |
|
|
|
|
| 33 |
| Architecture | Qwen3VLForConditionalGeneration |
|
| 34 |
| Stage | 1 / 2 — SFT (→ RL) |
|
| 35 |
| MechVQA Total | **76.36** |
|
| 36 |
+
| RL checkpoint | [MonteXiaofeng/MechVL-4B-RL](https://huggingface.co/MonteXiaofeng/MechVL-4B-RL) |
|
| 37 |
|
| 38 |
+
## Usage (transformers)
|
| 39 |
|
| 40 |
```python
|
| 41 |
import torch
|
| 42 |
+
from transformers import AutoProcessor, AutoModelForImageTextToText
|
| 43 |
|
| 44 |
+
model = AutoModelForImageTextToText.from_pretrained(
|
| 45 |
+
"MonteXiaofeng/MechVL-4B-SFT", dtype=torch.bfloat16, device_map="auto"
|
| 46 |
)
|
| 47 |
+
processor = AutoProcessor.from_pretrained("MonteXiaofeng/MechVL-4B-SFT")
|
| 48 |
|
| 49 |
messages = [{"role": "user", "content": [
|
| 50 |
+
{"type": "image", "url": "path/to/drawing.png"},
|
| 51 |
{"type": "text", "text": "图纸中标注的零件总长度是多少?"},
|
| 52 |
]}]
|
| 53 |
inputs = processor.apply_chat_template(
|
|
|
|
| 58 |
print(processor.decode(out[0], skip_special_tokens=True))
|
| 59 |
```
|
| 60 |
|
| 61 |
+
For **batch vLLM inference** (SFT/RL dual-mode), see [`scripts/batch_infer.py`](https://github.com/xiaofengShi/MechVQA/blob/main/scripts/batch_infer.py).
|
| 62 |
|
| 63 |
## Training
|
| 64 |
|
| 65 |
+
Full-parameter SFT on the LLM module (vision tower frozen) over the MechVQA training split, with a unified response schema (rationale + concise final answer). See [§4.1 of the paper](https://arxiv.org/abs/2605.30794).
|
| 66 |
|
| 67 |
## Citation
|
| 68 |
|