---
license: apache-2.0
language:
  - zh
  - en
tags:
  - multimodal
  - vision-language
  - mechanical-drawing
  - vqa
  - mechvqa
base_model: Qwen/Qwen3-VL-4B-Instruct
pipeline_tag: visual-question-answering
library_name: transformers
---

# MechVL-4B-SFT

The **SFT checkpoint** of **MechVL** — the domain-specialized multimodal model for mechanical engineering drawing understanding, introduced in:

> **MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding** (ICML 2026)

[![arXiv](https://img.shields.io/badge/arXiv-2605.30794-b31b1b.svg)](https://arxiv.org/abs/2605.30794)
[![Code](https://img.shields.io/badge/Code-GitHub-181717.svg)](https://github.com/xiaofengShi/MechVQA)

## Model description

MechVL-4B-SFT is initialized from `Qwen3-VL-4B-Instruct` and trained with **full-parameter SFT** on the LLM module (vision encoder & projection frozen) over the MechVQA training split. It serves as the **reference policy (π_ref)** for the subsequent RL stage.

| | |
|---|---|
| Base model | Qwen3-VL-4B-Instruct |
| Architecture | Qwen3VLForConditionalGeneration |
| Stage | 1 / 2 — SFT (→ RL) |
| MechVQA Total | **76.36** |
| RL checkpoint | [MonteXiaofeng/MechVL-4B-RL](https://huggingface.co/MonteXiaofeng/MechVL-4B-RL) |

## Usage (transformers)

```python
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained(
    "MonteXiaofeng/MechVL-4B-SFT", dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained("MonteXiaofeng/MechVL-4B-SFT")

messages = [{"role": "user", "content": [
    {"type": "image", "url": "path/to/drawing.png"},
    {"type": "text", "text": "图纸中标注的零件总长度是多少？"},
]}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt",
).to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(processor.decode(out[0], skip_special_tokens=True))
```

For **batch vLLM inference** (SFT/RL dual-mode), see [`scripts/batch_infer.py`](https://github.com/xiaofengShi/MechVQA/blob/main/scripts/batch_infer.py).

## Training

Full-parameter SFT on the LLM module (vision tower frozen) over the MechVQA training split, with a unified response schema (rationale + concise final answer). See [§4.1 of the paper](https://arxiv.org/abs/2605.30794).

## Citation

```bibtex
@misc{kou2026mechvqabenchmarkingenhancingmultimodal,
      title={MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding},
      author={Qian Kou and Xiaofeng Shi and Yulin Li and Xiaosong Qiu and Xinyang Wang and Hua Zhou and Cao Dongxing},
      year={2026},
      eprint={2605.30794},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.30794}
}
```

## License

Apache-2.0.