--- license: apache-2.0 base_model: Qwen/Qwen3-VL-32B-Instruct library_name: transformers pipeline_tag: image-text-to-text tags: - qwen3_vl - vision-language-model - multimodal - document-understanding - long-context - reasoning - synthetic-data - internalized-reasoning - model-merging - transformers ---
OriOn-Qwen-SR1 Banner
---
[![Website](https://img.shields.io/badge/LightOn-Website-blue?logo=google-chrome)](https://lighton.ai) [![LinkedIn](https://img.shields.io/badge/LightOn-LinkedIn-0A66C2?logo=linkedin)](https://www.linkedin.com/company/lighton/) [![X](https://img.shields.io/badge/@LightOnIO-X-black?logo=x)](https://x.com/LightOnIO) 📄 [Paper](https://arxiv.org/abs/2604.02371) | 📝 [OriOn Blog](https://www.lighton.ai/lighton-blogs/introducing-orion) | 🔧 [Pipeline Code](https://github.com/lightonai/distilabel/tree/lc_sft_pipelines) | 📊 [Benchmark (MMLBD-C)](https://huggingface.co/datasets/lightonai/MMLBD-C) | 🪐 [OriOn Collection](https://huggingface.co/collections/lightonai/orion)
# OriOn-Qwen Synthetic Reasoning 1 **SOTA on MMLongBenchDoc (58.3), surpassing a 7x larger model.** This checkpoint extends [OriOn-Qwen](https://huggingface.co/lightonai/OriOn-Qwen) with synthetic reasoning traces that are internalized via low-strength model merging, achieving frontier long-document QA performance with no increase in inference cost. ## TL;DR We introduce a synthetic reasoning pipeline for long-document VQA: score every page for question relevance, extract evidence, keep the top-K pages sorted by relevance, and use this as a structured `` trace during SFT. Low-strength model merging (α=0.25) then **internalizes** the reasoning: the model does not generate explicit thinking tokens, yet retains the full performance benefit. A `` control token gates the capability at inference time. The result is a 32B model that beats `Qwen3-VL-235B-A22B-Instruct` on MMLongBenchDoc while producing only ~250 mean output tokens. --- ## Highlights - **SOTA on MMLongBenchDoc** with 58.3 accuracy, surpassing `Qwen3-VL-235B-A22B-Instruct` (57.0) and `Thinking` (56.2) with **7x fewer parameters** - **Internalized reasoning** via low-strength model merging: no `` tokens emitted, yet full performance retained - **Controllable**: place `` in the system prompt to activate reasoning (+3.8 MMLBD when on vs. off) - **Drop-in replacement** for `Qwen/Qwen3-VL-32B-Instruct`: same `Qwen3VLForConditionalGeneration` + `AutoProcessor` API --- ## How It Works This checkpoint builds on [OriOn](https://arxiv.org/abs/2602.15257) and extends it with synthetic reasoning traces ([paper](https://arxiv.org/abs/2604.02371)). ### Synthetic reasoning pipeline Given a document of N pages and a question Q: 1. **Evidence extraction & scoring**: an extractor VLM (`Qwen3-VL-32B-Instruct`) processes each page independently, producing a relevance score ([0, 10]) and a natural-language evidence snippet. 2. **Top-K selection**: pages below threshold are dropped, the top-K (default 24) are kept and sorted by relevance. 3. **Answer generation** through two parallel branches: a **visual branch** (teacher VLM receives top-ranked page images) and a **text branch** (teacher LLM receives only the extracted evidence). Training examples are drawn equally from both. The relevance-sorted evidence is placed inside `` tags, gated by a `` control token (present in 95% of training examples). ### Internalization via model merging The final checkpoint is produced by task arithmetic: `θ_merged = θ_base + α · (θ_SFT − θ_base)`. At α=0.25, the model does not emit thinking tokens and its mean output length is comparable to a non-reasoning baseline, yet it retains the full performance gains. Increasing α to 0.5 shifts the model to explicit reasoning with 12.4x more output tokens. ### Why trace design matters An earlier v1 pipeline visited every page sequentially, marking irrelevant ones, teaching a pathological looping algorithm. The v2 redesign (bounded top-K, relevance-ordered, no irrelevant markers) eliminates the failure mode and yields substantial gains across all primary metrics. --- ## Related | Resource | Description | | ---------------------------------------------------------------------------------- | ------------------------------------------------------------ | | **[OriOn-Qwen](https://huggingface.co/lightonai/OriOn-Qwen)** | Base OriOn checkpoint (LongPO, no reasoning) | | **[OriOn-Mistral](https://huggingface.co/lightonai/OriOn-Mistral)** | Mistral variant with +16.8% MMLBD improvement | | **[MMLBD-C](https://huggingface.co/datasets/lightonai/MMLBD-C)** | Manually corrected MMLongBenchDoc benchmark | | **[Pipeline Code](https://github.com/lightonai/distilabel/tree/lc_sft_pipelines)** | Synthetic reasoning pipeline (Apache 2.0 fork of distilabel) | --- ## Benchmarks ### Official MMLongBenchDoc leaderboard | Model | Acc | Params | | ----------------------------------------------- | -------- | ----------------- | | **OriOn-Qwen-SR1 (this model)** | **58.3** | 32B | | Qwen3-VL-235B-A22B-Instruct | 57.0 | 235B (22B active) | | Qwen3-VL-235B-A22B-Thinking | 56.2 | 235B (22B active) | | TeleMM-2.0 | 56.1 | – | | Qwen3-VL-32B-Instruct | 55.4 | 32B | | GLM-4.6V | 54.9 | 106B (12B active) | | GPT-4o | 46.3 | – | ### Full benchmark suite (Qwen3-VL family) Deltas are relative to the `Qwen3-VL-32B-Instruct` base model. | Model | VA | LCA | MMLBD | MMLBD-C | MMLB 128K | SlideVQA | HELMET | DUDE | | ----------------------------------------------- | --------------- | --------------- | --------------- | --------------- | ----------- | ----------- | --------------- | ----------- | | 235B-A22B-Instruct | 98.4 | 98.5 | 54.8 | 56.2 | 78.6 | 84.5 | 67.6 | 59.1 | | **OriOn-Qwen-SR1 (this model)** | **95.0** (+1.3) | **94.4** (+2.3) | **55.8** (+4.0) | **58.2** (+4.4) | 75.7 (+5.3) | 75.4 (-1.8) | **68.5** (+5.5) | 55.1 (-6.7) | | LongPO (OriOn-Qwen) | 94.0 (+0.3) | 92.4 (+0.3) | 53.6 (+1.8) | 56.4 (+2.6) | 75.6 (+5.2) | 75.5 (-1.7) | 62.9 (-0.1) | 56.0 (-5.8) | | 32B-Instruct (base) | 93.7 | 92.1 | 51.8 | 53.8 | 70.4 | 77.2 | 63.0 | 61.8 | *VA = Visual-LC Average (MMLBD, MMLBD-C, MMLongBench, DUDE, SlideVQA). LCA = VA + HELMET + LongBench v2. See the [paper](https://arxiv.org/abs/2604.02371) for full results including Mistral, control-token ablations and trace-design comparisons.* --- ## Reasoning Behavior Place `` at the beginning of the system prompt to activate internalized reasoning. This improves performance with only a slight increase in output tokens. ``` System: User: What is the average revenue growth across all subsidiaries mentioned in pages 12-45? ``` Without ``, the model still works but performance degrades (e.g. -3.8 MMLBD for Qwen). The model does **not** emit `` tokens at α=0.25; the reasoning is internalized. --- ## Intended Use This checkpoint is designed for: - **Long PDF and slide-deck question answering** (up to 250+ pages in a single pass) - **Multi-page document reasoning** requiring cross-page synthesis - **Long-context visual document understanding** in enterprise, legal, scientific and financial domains This is a research checkpoint that retains most of `Qwen/Qwen3-VL-32B-Instruct`'s general capabilities while significantly improving long-document performance. --- ## Usage with Transformers This model uses the same API as `Qwen/Qwen3-VL-32B-Instruct`: ```python import torch from transformers import Qwen3VLForConditionalGeneration, AutoProcessor model_id = "lightonai/OriOn-Qwen-SR1" device = "cuda" if torch.cuda.is_available() else "cpu" model = Qwen3VLForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, device_map="auto" ) processor = AutoProcessor.from_pretrained(model_id) # Multi-page document QA with reasoning messages = [ {"role": "system", "content": ""}, { "role": "user", "content": [ {"type": "image", "url": "page1.png"}, {"type": "image", "url": "page2.png"}, # ... add all document pages {"type": "text", "text": "What are the key findings discussed across this document?"}, ], }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(device) output_ids = model.generate(**inputs, max_new_tokens=4096) generated_ids = output_ids[0, inputs["input_ids"].shape[1]:] print(processor.decode(generated_ids, skip_special_tokens=True)) ``` --- ## Usage with vLLM ```bash vllm serve lightonai/OriOn-Qwen-SR1 ``` ```python import base64 import io import requests import pypdfium2 as pdfium ENDPOINT = "http://localhost:8000/v1/chat/completions" MODEL = "lightonai/OriOn-Qwen-SR1" # Load and render a multi-page PDF pdf_data = requests.get("https://arxiv.org/pdf/2412.13663").content pdf = pdfium.PdfDocument(pdf_data) # Convert pages to base64 images page_images = [] for i in range(min(len(pdf), 50)): # cap at 50 pages for this example pil_image = pdf[i].render(scale=2.77).to_pil() buffer = io.BytesIO() pil_image.save(buffer, format="PNG") b64 = base64.b64encode(buffer.getvalue()).decode("utf-8") page_images.append({ "type": "image_url", "image_url": {"url": f"data:image/png;base64,{b64}"}, }) payload = { "model": MODEL, "messages": [ {"role": "system", "content": ""}, { "role": "user", "content": [ *page_images, {"type": "text", "text": "Summarize the main contributions of this paper."}, ], }, ], "max_tokens": 4096, "temperature": 0.2, } response = requests.post(ENDPOINT, json=payload) print(response.json()["choices"][0]["message"]["content"]) ``` --- ## Model Details | | | | ------------------- | ----------------------------------------------------------------------------------------- | | **Base model** | `Qwen/Qwen3-VL-32B-Instruct` | | **Architecture** | `Qwen3VLForConditionalGeneration` | | **Context length** | 262,144 tokens | | **Tensor type** | `bfloat16` | | **Processor** | `Qwen3VLProcessor` / `AutoProcessor` | | **Image processor** | `Qwen2VLImageProcessorFast` | | **Training** | SFT on 50K synthetic reasoning examples + external SFT data (Luth, Smoltalk2) | | **Merge strength** | α = 0.25 (task arithmetic with CPT + SFT vectors) | | **Compute** | ~40K H100 hours (main training), ~100K H100 hours (project total incl. eval and data gen) | --- ## License Apache License 2.0 --- ## Citation If you use this checkpoint, please cite both papers: ```bibtex @misc{long_document_internalized_reasoning, title={Internalized Reasoning for Long-Context Visual Document Understanding}, author={Austin Veselka}, year={2026}, eprint={2604.02371}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2604.02371}, } @misc{long_document_training, title={How to Train Your Long-Context Visual Document Model}, author={Austin Veselka}, year={2026}, eprint={2602.15257}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2602.15257}, } ``` [EU](https://huggingface.co/lightonai)