---

license: apache-2.0
base_model: Qwen/Qwen3-VL-32B-Instruct
library_name: transformers
pipeline_tag: image-text-to-text
tags:

- qwen3_vl
- vision-language-model
- multimodal
- document-understanding
- long-context
- reasoning
- synthetic-data
- internalized-reasoning
- model-merging
- transformers

---

<div align="center">
  <img src="Orion-Qwen.png" alt="OriOn-Qwen-SR1 Banner" width="600"/>
</div>

---


<div align="center">

[![Website](https://img.shields.io/badge/LightOn-Website-blue?logo=google-chrome)](https://lighton.ai)
[![LinkedIn](https://img.shields.io/badge/LightOn-LinkedIn-0A66C2?logo=linkedin)](https://www.linkedin.com/company/lighton/)
[![X](https://img.shields.io/badge/@LightOnIO-X-black?logo=x)](https://x.com/LightOnIO)

📄 [Paper](https://arxiv.org/abs/2604.02371) | 📝 [OriOn Blog](https://www.lighton.ai/lighton-blogs/introducing-orion) | 🔧 [Pipeline Code](https://github.com/lightonai/distilabel/tree/lc_sft_pipelines) | 📊 [Benchmark (MMLBD-C)](https://huggingface.co/datasets/lightonai/MMLBD-C) | 🪐 [OriOn Collection](https://huggingface.co/collections/lightonai/orion)

</div>


# OriOn-Qwen Synthetic Reasoning 1

**SOTA on MMLongBenchDoc (58.3), surpassing a 7x larger model.** This checkpoint extends [OriOn-Qwen](https://huggingface.co/lightonai/OriOn-Qwen) with synthetic reasoning traces that are internalized via low-strength model merging, achieving frontier long-document QA performance with no increase in inference cost.

## TL;DR

We introduce a synthetic reasoning pipeline for long-document VQA: score every page for question relevance, extract evidence, keep the top-K pages sorted by relevance, and use this as a structured `<think>` trace during SFT. Low-strength model merging (α=0.25) then **internalizes** the reasoning: the model does not generate explicit thinking tokens, yet retains the full performance benefit. A `<cot>` control token gates the capability at inference time. The result is a 32B model that beats `Qwen3-VL-235B-A22B-Instruct` on MMLongBenchDoc while producing only ~250 mean output tokens.

---

## Highlights

- **SOTA on MMLongBenchDoc** with 58.3 accuracy, surpassing `Qwen3-VL-235B-A22B-Instruct` (57.0) and `Thinking` (56.2) with **7x fewer parameters**
- **Internalized reasoning** via low-strength model merging: no `<think>` tokens emitted, yet full performance retained
- **Controllable**: place `<cot>` in the system prompt to activate reasoning (+3.8 MMLBD when on vs. off)
- **Drop-in replacement** for `Qwen/Qwen3-VL-32B-Instruct`: same `Qwen3VLForConditionalGeneration` + `AutoProcessor` API

---

## How It Works

This checkpoint builds on [OriOn](https://arxiv.org/abs/2602.15257) and extends it with synthetic reasoning traces ([paper](https://arxiv.org/abs/2604.02371)).

### Synthetic reasoning pipeline

Given a document of N pages and a question Q:

1. **Evidence extraction & scoring**: an extractor VLM (`Qwen3-VL-32B-Instruct`) processes each page independently, producing a relevance score ([0, 10]) and a natural-language evidence snippet.
2. **Top-K selection**: pages below threshold are dropped, the top-K (default 24) are kept and sorted by relevance.
3. **Answer generation** through two parallel branches: a **visual branch** (teacher VLM receives top-ranked page images) and a **text branch** (teacher LLM receives only the extracted evidence). Training examples are drawn equally from both.

The relevance-sorted evidence is placed inside `<think>` tags, gated by a `<cot>` control token (present in 95% of training examples).

### Internalization via model merging

The final checkpoint is produced by task arithmetic: `θ_merged = θ_base + α · (θ_SFT − θ_base)`. At α=0.25, the model does not emit thinking tokens and its mean output length is comparable to a non-reasoning baseline, yet it retains the full performance gains. Increasing α to 0.5 shifts the model to explicit reasoning with 12.4x more output tokens.

### Why trace design matters

An earlier v1 pipeline visited every page sequentially, marking irrelevant ones, teaching a pathological looping algorithm. The v2 redesign (bounded top-K, relevance-ordered, no irrelevant markers) eliminates the failure mode and yields substantial gains across all primary metrics.

---

## Related


| Resource                                                                           | Description                                                  |
| ---------------------------------------------------------------------------------- | ------------------------------------------------------------ |
| **[OriOn-Qwen](https://huggingface.co/lightonai/OriOn-Qwen)**                      | Base OriOn checkpoint (LongPO, no reasoning)                 |
| **[OriOn-Mistral](https://huggingface.co/lightonai/OriOn-Mistral)**                | Mistral variant with +16.8% MMLBD improvement                |
| **[MMLBD-C](https://huggingface.co/datasets/lightonai/MMLBD-C)**                   | Manually corrected MMLongBenchDoc benchmark                  |
| **[Pipeline Code](https://github.com/lightonai/distilabel/tree/lc_sft_pipelines)** | Synthetic reasoning pipeline (Apache 2.0 fork of distilabel) |


---

## Benchmarks

### Official MMLongBenchDoc leaderboard


| Model                                           | Acc      | Params            |
| ----------------------------------------------- | -------- | ----------------- |
| **OriOn-Qwen-SR1 (this model)** | **58.3** | 32B               |
| Qwen3-VL-235B-A22B-Instruct                     | 57.0     | 235B (22B active) |
| Qwen3-VL-235B-A22B-Thinking                     | 56.2     | 235B (22B active) |
| TeleMM-2.0                                      | 56.1     | –                 |
| Qwen3-VL-32B-Instruct                           | 55.4     | 32B               |
| GLM-4.6V                                        | 54.9     | 106B (12B active) |
| GPT-4o                                          | 46.3     | –                 |


### Full benchmark suite (Qwen3-VL family)

Deltas are relative to the `Qwen3-VL-32B-Instruct` base model.


| Model                                           | VA              | LCA             | MMLBD           | MMLBD-C         | MMLB 128K   | SlideVQA    | HELMET          | DUDE        |
| ----------------------------------------------- | --------------- | --------------- | --------------- | --------------- | ----------- | ----------- | --------------- | ----------- |
| 235B-A22B-Instruct                              | 98.4            | 98.5            | 54.8            | 56.2            | 78.6        | 84.5        | 67.6            | 59.1        |
| **OriOn-Qwen-SR1 (this model)** | **95.0** (+1.3) | **94.4** (+2.3) | **55.8** (+4.0) | **58.2** (+4.4) | 75.7 (+5.3) | 75.4 (-1.8) | **68.5** (+5.5) | 55.1 (-6.7) |
| LongPO (OriOn-Qwen)                             | 94.0 (+0.3)     | 92.4 (+0.3)     | 53.6 (+1.8)     | 56.4 (+2.6)     | 75.6 (+5.2) | 75.5 (-1.7) | 62.9 (-0.1)     | 56.0 (-5.8) |
| 32B-Instruct (base)                             | 93.7            | 92.1            | 51.8            | 53.8            | 70.4        | 77.2        | 63.0            | 61.8        |


*VA = Visual-LC Average (MMLBD, MMLBD-C, MMLongBench, DUDE, SlideVQA). LCA = VA + HELMET + LongBench v2. See the [paper](https://arxiv.org/abs/2604.02371) for full results including Mistral, control-token ablations and trace-design comparisons.*

---

## Reasoning Behavior

Place `<cot>` at the beginning of the system prompt to activate internalized reasoning. This improves performance with only a slight increase in output tokens.

```
System: <cot>
User: What is the average revenue growth across all subsidiaries mentioned in pages 12-45?
```

Without `<cot>`, the model still works but performance degrades (e.g. -3.8 MMLBD for Qwen). The model does **not** emit `<think>` tokens at α=0.25; the reasoning is internalized.

---

## Intended Use

This checkpoint is designed for:

- **Long PDF and slide-deck question answering** (up to 250+ pages in a single pass)
- **Multi-page document reasoning** requiring cross-page synthesis
- **Long-context visual document understanding** in enterprise, legal, scientific and financial domains

This is a research checkpoint that retains most of `Qwen/Qwen3-VL-32B-Instruct`'s general capabilities while significantly improving long-document performance.

---

## Usage with Transformers

This model uses the same API as `Qwen/Qwen3-VL-32B-Instruct`:

```python
import torch
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor

model_id = "lightonai/OriOn-Qwen-SR1"
device = "cuda" if torch.cuda.is_available() else "cpu"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# Multi-page document QA with <cot> reasoning
messages = [
    {"role": "system", "content": "<cot>"},
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "page1.png"},
            {"type": "image", "url": "page2.png"},
            # ... add all document pages
            {"type": "text", "text": "What are the key findings discussed across this document?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(device)

output_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids = output_ids[0, inputs["input_ids"].shape[1]:]
print(processor.decode(generated_ids, skip_special_tokens=True))
```

---

## Usage with vLLM

```bash
vllm serve lightonai/OriOn-Qwen-SR1
```

```python
import base64
import io
import requests
import pypdfium2 as pdfium

ENDPOINT = "http://localhost:8000/v1/chat/completions"
MODEL = "lightonai/OriOn-Qwen-SR1"

# Load and render a multi-page PDF
pdf_data = requests.get("https://arxiv.org/pdf/2412.13663").content
pdf = pdfium.PdfDocument(pdf_data)

# Convert pages to base64 images
page_images = []
for i in range(min(len(pdf), 50)):  # cap at 50 pages for this example
    pil_image = pdf[i].render(scale=2.77).to_pil()
    buffer = io.BytesIO()
    pil_image.save(buffer, format="PNG")
    b64 = base64.b64encode(buffer.getvalue()).decode("utf-8")
    page_images.append({
        "type": "image_url",
        "image_url": {"url": f"data:image/png;base64,{b64}"},
    })

payload = {
    "model": MODEL,
    "messages": [
        {"role": "system", "content": "<cot>"},
        {
            "role": "user",
            "content": [
                *page_images,
                {"type": "text", "text": "Summarize the main contributions of this paper."},
            ],
        },
    ],
    "max_tokens": 4096,
    "temperature": 0.2,
}

response = requests.post(ENDPOINT, json=payload)
print(response.json()["choices"][0]["message"]["content"])
```

---

## Model Details


|                     |                                                                                           |
| ------------------- | ----------------------------------------------------------------------------------------- |
| **Base model**      | `Qwen/Qwen3-VL-32B-Instruct`                                                              |
| **Architecture**    | `Qwen3VLForConditionalGeneration`                                                         |
| **Context length**  | 262,144 tokens                                                                            |
| **Tensor type**     | `bfloat16`                                                                                |
| **Processor**       | `Qwen3VLProcessor` / `AutoProcessor`                                                      |
| **Image processor** | `Qwen2VLImageProcessorFast`                                                               |
| **Training**        | SFT on 50K synthetic reasoning examples + external SFT data (Luth, Smoltalk2)             |
| **Merge strength**  | α = 0.25 (task arithmetic with CPT + SFT vectors)                                         |
| **Compute**         | ~40K H100 hours (main training), ~100K H100 hours (project total incl. eval and data gen) |


---

## License

Apache License 2.0

---

## Citation

If you use this checkpoint, please cite both papers:

```bibtex
@misc{long_document_internalized_reasoning,
  title={Internalized Reasoning for Long-Context Visual Document Understanding}, 
  author={Austin Veselka},
  year={2026},
  eprint={2604.02371},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2604.02371}, 
}

@misc{long_document_training,
  title={How to Train Your Long-Context Visual Document Model},
  author={Austin Veselka},
  year={2026},
  eprint={2602.15257},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2602.15257},
}
```

[EU](https://huggingface.co/lightonai)