--- license: apache-2.0 language: - zh - en tags: - multimodal - vision-language - mechanical-drawing - vqa - mechvqa base_model: Qwen/Qwen3-VL-4B-Instruct pipeline_tag: visual-question-answering library_name: transformers --- # MechVL-4B-SFT The **SFT checkpoint** of **MechVL** — the domain-specialized multimodal model for mechanical engineering drawing understanding, introduced in: > **MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding** (ICML 2026) [![arXiv](https://img.shields.io/badge/arXiv-2605.30794-b31b1b.svg)](https://arxiv.org/abs/2605.30794) [![Code](https://img.shields.io/badge/Code-GitHub-181717.svg)](https://github.com/xiaofengShi/MechVQA) ## Model description MechVL-4B-SFT is initialized from `Qwen3-VL-4B-Instruct` and trained with **full-parameter SFT** on the LLM module (vision encoder & projection frozen) over the MechVQA training split. It serves as the **reference policy (π_ref)** for the subsequent RL stage. | | | |---|---| | Base model | Qwen3-VL-4B-Instruct | | Architecture | Qwen3VLForConditionalGeneration | | Stage | 1 / 2 — SFT (→ RL) | | MechVQA Total | **76.36** | | RL checkpoint | [MonteXiaofeng/MechVL-4B-RL](https://huggingface.co/MonteXiaofeng/MechVL-4B-RL) | ## Usage (transformers) ```python import torch from transformers import AutoProcessor, AutoModelForImageTextToText model = AutoModelForImageTextToText.from_pretrained( "MonteXiaofeng/MechVL-4B-SFT", dtype=torch.bfloat16, device_map="auto" ) processor = AutoProcessor.from_pretrained("MonteXiaofeng/MechVL-4B-SFT") messages = [{"role": "user", "content": [ {"type": "image", "url": "path/to/drawing.png"}, {"type": "text", "text": "图纸中标注的零件总长度是多少?"}, ]}] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) out = model.generate(**inputs, max_new_tokens=1024) print(processor.decode(out[0], skip_special_tokens=True)) ``` For **batch vLLM inference** (SFT/RL dual-mode), see [`scripts/batch_infer.py`](https://github.com/xiaofengShi/MechVQA/blob/main/scripts/batch_infer.py). ## Training Full-parameter SFT on the LLM module (vision tower frozen) over the MechVQA training split, with a unified response schema (rationale + concise final answer). See [§4.1 of the paper](https://arxiv.org/abs/2605.30794). ## Citation ```bibtex @misc{kou2026mechvqabenchmarkingenhancingmultimodal, title={MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding}, author={Qian Kou and Xiaofeng Shi and Yulin Li and Xiaosong Qiu and Xinyang Wang and Hua Zhou and Cao Dongxing}, year={2026}, eprint={2605.30794}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2605.30794} } ``` ## License Apache-2.0.