---
license: apache-2.0
base_model: Qwen/Qwen3-VL-32B-Instruct
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- qwen3_vl
- vision-language-model
- multimodal
- document-understanding
- long-context
- reasoning
- synthetic-data
- internalized-reasoning
- model-merging
- transformers
---
---
[](https://lighton.ai)
[](https://www.linkedin.com/company/lighton/)
[](https://x.com/LightOnIO)
📄 [Paper](https://arxiv.org/abs/2604.02371) | 📝 [OriOn Blog](https://www.lighton.ai/lighton-blogs/introducing-orion) | 🔧 [Pipeline Code](https://github.com/lightonai/distilabel/tree/lc_sft_pipelines) | 📊 [Benchmark (MMLBD-C)](https://huggingface.co/datasets/lightonai/MMLBD-C) | 🪐 [OriOn Collection](https://huggingface.co/collections/lightonai/orion)
# OriOn-Qwen Synthetic Reasoning 1
**SOTA on MMLongBenchDoc (58.3), surpassing a 7x larger model.** This checkpoint extends [OriOn-Qwen](https://huggingface.co/lightonai/OriOn-Qwen) with synthetic reasoning traces that are internalized via low-strength model merging, achieving frontier long-document QA performance with no increase in inference cost.
## TL;DR
We introduce a synthetic reasoning pipeline for long-document VQA: score every page for question relevance, extract evidence, keep the top-K pages sorted by relevance, and use this as a structured `` trace during SFT. Low-strength model merging (α=0.25) then **internalizes** the reasoning: the model does not generate explicit thinking tokens, yet retains the full performance benefit. A `` control token gates the capability at inference time. The result is a 32B model that beats `Qwen3-VL-235B-A22B-Instruct` on MMLongBenchDoc while producing only ~250 mean output tokens.
---
## Highlights
- **SOTA on MMLongBenchDoc** with 58.3 accuracy, surpassing `Qwen3-VL-235B-A22B-Instruct` (57.0) and `Thinking` (56.2) with **7x fewer parameters**
- **Internalized reasoning** via low-strength model merging: no `` tokens emitted, yet full performance retained
- **Controllable**: place `` in the system prompt to activate reasoning (+3.8 MMLBD when on vs. off)
- **Drop-in replacement** for `Qwen/Qwen3-VL-32B-Instruct`: same `Qwen3VLForConditionalGeneration` + `AutoProcessor` API
---
## How It Works
This checkpoint builds on [OriOn](https://arxiv.org/abs/2602.15257) and extends it with synthetic reasoning traces ([paper](https://arxiv.org/abs/2604.02371)).
### Synthetic reasoning pipeline
Given a document of N pages and a question Q:
1. **Evidence extraction & scoring**: an extractor VLM (`Qwen3-VL-32B-Instruct`) processes each page independently, producing a relevance score ([0, 10]) and a natural-language evidence snippet.
2. **Top-K selection**: pages below threshold are dropped, the top-K (default 24) are kept and sorted by relevance.
3. **Answer generation** through two parallel branches: a **visual branch** (teacher VLM receives top-ranked page images) and a **text branch** (teacher LLM receives only the extracted evidence). Training examples are drawn equally from both.
The relevance-sorted evidence is placed inside `` tags, gated by a `` control token (present in 95% of training examples).
### Internalization via model merging
The final checkpoint is produced by task arithmetic: `θ_merged = θ_base + α · (θ_SFT − θ_base)`. At α=0.25, the model does not emit thinking tokens and its mean output length is comparable to a non-reasoning baseline, yet it retains the full performance gains. Increasing α to 0.5 shifts the model to explicit reasoning with 12.4x more output tokens.
### Why trace design matters
An earlier v1 pipeline visited every page sequentially, marking irrelevant ones, teaching a pathological looping algorithm. The v2 redesign (bounded top-K, relevance-ordered, no irrelevant markers) eliminates the failure mode and yields substantial gains across all primary metrics.
---
## Related
| Resource | Description |
| ---------------------------------------------------------------------------------- | ------------------------------------------------------------ |
| **[OriOn-Qwen](https://huggingface.co/lightonai/OriOn-Qwen)** | Base OriOn checkpoint (LongPO, no reasoning) |
| **[OriOn-Mistral](https://huggingface.co/lightonai/OriOn-Mistral)** | Mistral variant with +16.8% MMLBD improvement |
| **[MMLBD-C](https://huggingface.co/datasets/lightonai/MMLBD-C)** | Manually corrected MMLongBenchDoc benchmark |
| **[Pipeline Code](https://github.com/lightonai/distilabel/tree/lc_sft_pipelines)** | Synthetic reasoning pipeline (Apache 2.0 fork of distilabel) |
---
## Benchmarks
### Official MMLongBenchDoc leaderboard
| Model | Acc | Params |
| ----------------------------------------------- | -------- | ----------------- |
| **OriOn-Qwen-SR1 (this model)** | **58.3** | 32B |
| Qwen3-VL-235B-A22B-Instruct | 57.0 | 235B (22B active) |
| Qwen3-VL-235B-A22B-Thinking | 56.2 | 235B (22B active) |
| TeleMM-2.0 | 56.1 | – |
| Qwen3-VL-32B-Instruct | 55.4 | 32B |
| GLM-4.6V | 54.9 | 106B (12B active) |
| GPT-4o | 46.3 | – |
### Full benchmark suite (Qwen3-VL family)
Deltas are relative to the `Qwen3-VL-32B-Instruct` base model.
| Model | VA | LCA | MMLBD | MMLBD-C | MMLB 128K | SlideVQA | HELMET | DUDE |
| ----------------------------------------------- | --------------- | --------------- | --------------- | --------------- | ----------- | ----------- | --------------- | ----------- |
| 235B-A22B-Instruct | 98.4 | 98.5 | 54.8 | 56.2 | 78.6 | 84.5 | 67.6 | 59.1 |
| **OriOn-Qwen-SR1 (this model)** | **95.0** (+1.3) | **94.4** (+2.3) | **55.8** (+4.0) | **58.2** (+4.4) | 75.7 (+5.3) | 75.4 (-1.8) | **68.5** (+5.5) | 55.1 (-6.7) |
| LongPO (OriOn-Qwen) | 94.0 (+0.3) | 92.4 (+0.3) | 53.6 (+1.8) | 56.4 (+2.6) | 75.6 (+5.2) | 75.5 (-1.7) | 62.9 (-0.1) | 56.0 (-5.8) |
| 32B-Instruct (base) | 93.7 | 92.1 | 51.8 | 53.8 | 70.4 | 77.2 | 63.0 | 61.8 |
*VA = Visual-LC Average (MMLBD, MMLBD-C, MMLongBench, DUDE, SlideVQA). LCA = VA + HELMET + LongBench v2. See the [paper](https://arxiv.org/abs/2604.02371) for full results including Mistral, control-token ablations and trace-design comparisons.*
---
## Reasoning Behavior
Place `` at the beginning of the system prompt to activate internalized reasoning. This improves performance with only a slight increase in output tokens.
```
System:
User: What is the average revenue growth across all subsidiaries mentioned in pages 12-45?
```
Without ``, the model still works but performance degrades (e.g. -3.8 MMLBD for Qwen). The model does **not** emit `` tokens at α=0.25; the reasoning is internalized.
---
## Intended Use
This checkpoint is designed for:
- **Long PDF and slide-deck question answering** (up to 250+ pages in a single pass)
- **Multi-page document reasoning** requiring cross-page synthesis
- **Long-context visual document understanding** in enterprise, legal, scientific and financial domains
This is a research checkpoint that retains most of `Qwen/Qwen3-VL-32B-Instruct`'s general capabilities while significantly improving long-document performance.
---
## Usage with Transformers
This model uses the same API as `Qwen/Qwen3-VL-32B-Instruct`:
```python
import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
model_id = "lightonai/OriOn-Qwen-SR1"
device = "cuda" if torch.cuda.is_available() else "cpu"
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
# Multi-page document QA with reasoning
messages = [
{"role": "system", "content": ""},
{
"role": "user",
"content": [
{"type": "image", "url": "page1.png"},
{"type": "image", "url": "page2.png"},
# ... add all document pages
{"type": "text", "text": "What are the key findings discussed across this document?"},
],
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(device)
output_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids = output_ids[0, inputs["input_ids"].shape[1]:]
print(processor.decode(generated_ids, skip_special_tokens=True))
```
---
## Usage with vLLM
```bash
vllm serve lightonai/OriOn-Qwen-SR1
```
```python
import base64
import io
import requests
import pypdfium2 as pdfium
ENDPOINT = "http://localhost:8000/v1/chat/completions"
MODEL = "lightonai/OriOn-Qwen-SR1"
# Load and render a multi-page PDF
pdf_data = requests.get("https://arxiv.org/pdf/2412.13663").content
pdf = pdfium.PdfDocument(pdf_data)
# Convert pages to base64 images
page_images = []
for i in range(min(len(pdf), 50)): # cap at 50 pages for this example
pil_image = pdf[i].render(scale=2.77).to_pil()
buffer = io.BytesIO()
pil_image.save(buffer, format="PNG")
b64 = base64.b64encode(buffer.getvalue()).decode("utf-8")
page_images.append({
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{b64}"},
})
payload = {
"model": MODEL,
"messages": [
{"role": "system", "content": ""},
{
"role": "user",
"content": [
*page_images,
{"type": "text", "text": "Summarize the main contributions of this paper."},
],
},
],
"max_tokens": 4096,
"temperature": 0.2,
}
response = requests.post(ENDPOINT, json=payload)
print(response.json()["choices"][0]["message"]["content"])
```
---
## Model Details
| | |
| ------------------- | ----------------------------------------------------------------------------------------- |
| **Base model** | `Qwen/Qwen3-VL-32B-Instruct` |
| **Architecture** | `Qwen3VLForConditionalGeneration` |
| **Context length** | 262,144 tokens |
| **Tensor type** | `bfloat16` |
| **Processor** | `Qwen3VLProcessor` / `AutoProcessor` |
| **Image processor** | `Qwen2VLImageProcessorFast` |
| **Training** | SFT on 50K synthetic reasoning examples + external SFT data (Luth, Smoltalk2) |
| **Merge strength** | α = 0.25 (task arithmetic with CPT + SFT vectors) |
| **Compute** | ~40K H100 hours (main training), ~100K H100 hours (project total incl. eval and data gen) |
---
## License
Apache License 2.0
---
## Citation
If you use this checkpoint, please cite both papers:
```bibtex
@misc{long_document_internalized_reasoning,
title={Internalized Reasoning for Long-Context Visual Document Understanding},
author={Austin Veselka},
year={2026},
eprint={2604.02371},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2604.02371},
}
@misc{long_document_training,
title={How to Train Your Long-Context Visual Document Model},
author={Austin Veselka},
year={2026},
eprint={2602.15257},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.15257},
}
```
[EU](https://huggingface.co/lightonai)