rootlocalghost commited on
Commit
44e4b3f
·
verified ·
1 Parent(s): 24390d8

clone README.md

Browse files
Files changed (1) hide show
  1. README.md +341 -0
README.md ADDED
@@ -0,0 +1,341 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ license_link: LICENSE
4
+ language:
5
+ - multilingual
6
+ tags:
7
+ - vision-language
8
+ - ocr
9
+ - document-intelligence
10
+ - qianfan
11
+ pipeline_tag: image-text-to-text
12
+ library_name: transformers
13
+ model-index:
14
+ - name: Qianfan-OCR
15
+ results:
16
+ - task:
17
+ type: document-parsing
18
+ name: Document Parsing
19
+ dataset:
20
+ name: OmniDocBench v1.5
21
+ type: opendatalab/OmniDocBench
22
+ metrics:
23
+ - type: overall
24
+ value: 93.12
25
+ name: Overall Score
26
+ - task:
27
+ type: ocr
28
+ name: OCR
29
+ dataset:
30
+ name: OlmOCR Bench
31
+ type: allenai/olmOCR-bench
32
+ metrics:
33
+ - type: accuracy
34
+ value: 79.8
35
+ name: Overall Score
36
+ - task:
37
+ type: ocr
38
+ name: OCR
39
+ dataset:
40
+ name: OCRBench
41
+ type: echo840/OCRBench
42
+ metrics:
43
+ - type: accuracy
44
+ value: 880
45
+ name: Score
46
+ ---
47
+
48
+ <div align="center">
49
+
50
+ <h1>Qianfan-OCR</h1>
51
+
52
+ <h3>A Unified End-to-End Model for Document Intelligence</h3>
53
+
54
+ [**🤖 Demo**](https://huggingface.co/spaces/baidu/Qianfan-OCR-Demo) |
55
+ [**📄 Technical Report**](https://arxiv.org/abs/2603.13398) |
56
+ [**🖥️ Qianfan Platform**](https://cloud.baidu.com/product-s/qianfan_home) |
57
+ [**💻 GitHub**](https://github.com/baidubce/Qianfan-VL) |
58
+ [**🧩 Skill**](https://github.com/baidubce/skills/tree/develop/skills/qianfanocr-document-intelligence)
59
+
60
+ </div>
61
+
62
+ ## Introduction
63
+
64
+ **Qianfan-OCR** is a **4B-parameter end-to-end document intelligence model** developed by the Baidu Qianfan Team. It unifies document parsing, layout analysis, and document understanding within a single vision-language architecture.
65
+
66
+ Unlike traditional multi-stage OCR pipelines that chain separate layout detection, text recognition, and language comprehension modules, Qianfan-OCR performs **direct image-to-Markdown conversion** and supports a broad range of prompt-driven tasks — from structured document parsing and table extraction to chart understanding, document question answering, and key information extraction — all within one model.
67
+
68
+ ### Key Highlights
69
+
70
+ - 🏆 **#1 End-to-End Model on OmniDocBench v1.5** — Achieves **93.12** overall score, surpassing DeepSeek-OCR-v2 (91.09), Gemini-3 Pro (90.33), and all other end-to-end models
71
+ - 🏆 **#1 End-to-End Model on OlmOCR Bench** — Scores **79.8**
72
+ - 🏆 **#1 on Key Information Extraction** — Overall mean score of **87.9** across five public KIE benchmarks, surpassing Gemini-3.1-Pro, Gemini-3-Pro, Seed-2.0, and Qwen3-VL-235B-A22B
73
+ - 🧠 **Layout-as-Thought** — An innovative optional thinking phase that recovers explicit layout analysis within the end-to-end paradigm via `⟨think⟩` tokens
74
+ - 🌍 **192 Languages** — Multilingual OCR support across diverse scripts
75
+ - ⚡ **Efficient Deployment** — Achieves **1.024 PPS** (pages per second) with W8A8 quantization on a single A100 GPU
76
+
77
+ ## Architecture
78
+
79
+ Qianfan-OCR adopts the multimodal bridging architecture from [Qianfan-VL](https://arxiv.org/abs/2509.18189), consisting of three core components:
80
+
81
+ | Component | Details |
82
+ |---|---|
83
+ | **Vision Encoder** | Qianfan-ViT, 24 Transformer layers, AnyResolution design (up to 4K), 256 visual tokens per 448×448 tile, max 4,096 tokens per image |
84
+ | **Language Model** | Qwen3-4B (3.6B non-embedding), 36 layers, 2560 hidden dim, GQA (32 query / 8 KV heads), 32K context (extendable to 131K) |
85
+ | **Cross-Modal Adapter** | 2-layer MLP with GELU activation, projecting from 1024-dim to 2560-dim |
86
+
87
+ ### Layout-as-Thought
88
+
89
+ A key innovation is **Layout-as-Thought**: an optional thinking phase triggered by `⟨think⟩` tokens, where the model generates structured layout representations (bounding boxes, element types, reading order) before producing final outputs.
90
+
91
+ This mechanism serves two purposes:
92
+ 1. **Functional**: Recovers layout analysis capability within the end-to-end paradigm — users obtain structured layout results directly
93
+ 2. **Enhancement**: Provides targeted accuracy improvements on documents with complex layouts, cluttered elements, or non-standard reading orders
94
+
95
+ > **When to use**: Enable thinking for heterogeneous pages with mixed element types (exam papers, technical reports, newspapers). Disable for homogeneous documents (single-column text, simple forms) for better results and lower latency.
96
+
97
+ ## Benchmark Results
98
+
99
+ ### OmniDocBench v1.5 (Document Parsing)
100
+
101
+ | Model | Type | Overall ↑ | TextEdit ↓ | FormulaCDM ↑ | TableTEDs ↑ | TableTEDss ↑ | R-orderEdit ↓ |
102
+ |---|---|---|---|---|---|---|---|
103
+ | **Qianfan-OCR (Ours)** | End-to-end | **93.12** | **0.041** | **92.43** | **91.02** | **93.85** | **0.049** |
104
+ | DeepSeek-OCR-v2 | End-to-end | 91.09 | 0.048 | 90.31 | 87.75 | 92.06 | 0.057 |
105
+ | Gemini-3 Pro | End-to-end | 90.33 | 0.065 | 89.18 | 88.28 | 90.29 | 0.071 |
106
+ | Qwen3-VL-235B | End-to-end | 89.15 | 0.069 | 88.14 | 86.21 | 90.55 | 0.068 |
107
+ | dots.ocr | End-to-end | 88.41 | 0.048 | 83.22 | 86.78 | 90.62 | 0.053 |
108
+ | PaddleOCR-VL 1.5 | Pipeline | 94.50 | 0.035 | 94.21 | 92.76 | 95.79 | 0.042 |
109
+
110
+ ### General OCR Benchmarks
111
+
112
+ | Model | OCRBench | OCRBenchv2 (en/zh) | CCOCR-multilan | CCOCR-overall |
113
+ |---|---|---|---|---|
114
+ | **Qianfan-OCR (Ours)** | **880** | 56.0 / **60.77** | **76.7** | **79.3** |
115
+ | Qwen3-VL-4B | 873 | **60.68** / 59.13 | 74.2 | 76.5 |
116
+ | MonkeyOCR | 655 | 21.78 / 38.91 | 43.8 | 35.2 |
117
+ | DeepSeek-OCR | 459 | 15.98 / 38.31 | 32.5 | 27.6 |
118
+
119
+ ### Document Understanding
120
+
121
+ | Benchmark | Qianfan-OCR | Qwen3-VL-4B | Qwen3-VL-2B |
122
+ |---|---|---|---|
123
+ | DocVQA | 92.8 | **94.9** | 92.7 |
124
+ | CharXiv_DQ | **94.0** | 81.8 | 69.7 |
125
+ | CharXiv_RQ | **85.2** | 48.5 | 41.3 |
126
+ | ChartQA | **88.1** | 83.3 | 78.3 |
127
+ | ChartQAPro | **42.9** | 36.2 | 24.5 |
128
+ | ChartBench | **85.9** | 74.9 | 73.2 |
129
+ | TextVQA | 80.0 | **81.8** | 79.9 |
130
+ | OCRVQA | **66.8** | 64.7 | 59.3 |
131
+
132
+ > 💡 Two-stage OCR+LLM systems score **0.0** on CharXiv (both DQ and RQ), demonstrating that chart structures discarded during text extraction are essential for reasoning.
133
+
134
+ ### Key Information Extraction (KIE)
135
+
136
+ | Model | Overall | OCRBench KIE | OCRBenchv2 KIE (en) | OCRBenchv2 KIE (zh) | CCOCR KIE | Nanonets KIE (F1) |
137
+ |---|---|---|---|---|---|---|
138
+ | **Qianfan-OCR (Ours)** | **87.9** | 95.0 | 82.8 | **82.3** | 92.8 | **86.5** |
139
+ | Qwen3-VL-235B-A22B | 84.2 | 94.0 | 85.6 | 62.9 | **95.1** | 83.8 |
140
+ | Qwen3-4B-VL | 83.5 | 89.0 | 82.1 | 71.3 | 91.6 | 83.3 |
141
+ | Gemini-3.1-Pro | 79.2 | **96.0** | **87.8** | 63.4 | 72.5 | 76.1 |
142
+
143
+ ### Inference Throughput
144
+
145
+ | Model | PPS (pages/sec) |
146
+ |---|---|
147
+ | **Qianfan-OCR (W8A8)** | **1.024** |
148
+ | Qianfan-OCR (W16A16) | 0.503 |
149
+ | MinerU 2.5 | 1.057 |
150
+ | MonkeyOCR-pro-1.2B | 0.673 |
151
+ | Dots OCR | 0.352 |
152
+
153
+ *All benchmarks on a single NVIDIA A100 GPU with vLLM 0.10.2.*
154
+
155
+ ## Supported Tasks
156
+
157
+ Qianfan-OCR supports a comprehensive set of document intelligence tasks through prompt-driven control:
158
+
159
+ | Task Category | Specific Tasks |
160
+ |---|---|
161
+ | **Document Parsing** | Image-to-Markdown conversion, multi-page parsing, structured output (JSON/HTML) |
162
+ | **Layout Analysis** | Bounding box detection, element type classification (25 categories), reading order |
163
+ | **Table Recognition** | Complex table extraction (merged cells, rotated tables), HTML output |
164
+ | **Formula Recognition** | Inline and display math formulas, LaTeX output |
165
+ | **Chart Understanding** | Chart QA, trend analysis, data extraction from various chart types |
166
+ | **Key Information Extraction** | Receipts, invoices, certificates, medical records, ID cards |
167
+ | **Handwriting Recognition** | Chinese and English handwritten text |
168
+ | **Scene Text Recognition** | Street signs, product labels, natural scene text |
169
+ | **Multilingual OCR** | 192 languages including Latin, Cyrillic, Arabic, South/Southeast Asian, CJK scripts |
170
+
171
+ ## Quick Start
172
+
173
+ ### Basic Usage
174
+
175
+ ```python
176
+ from transformers import AutoModelForImageTextToText, AutoProcessor
177
+ import torch
178
+ from PIL import Image
179
+
180
+ MODEL_PATH = "baidu/Qianfan-OCR"
181
+ model = AutoModelForImageTextToText.from_pretrained(
182
+ MODEL_PATH,
183
+ torch_dtype=torch.bfloat16,
184
+ device_map="auto",
185
+ ).eval()
186
+ processor = AutoProcessor.from_pretrained(MODEL_PATH)
187
+
188
+ image = Image.open("./examples/document.png").convert("RGB")
189
+ prompt = "Parse this document to Markdown."
190
+ messages = [
191
+ {
192
+ "role": "user",
193
+ "content": [
194
+ {"type": "image", "image": image},
195
+ {"type": "text", "text": prompt},
196
+ ],
197
+ },
198
+ ]
199
+
200
+ inputs = processor.apply_chat_template(
201
+ messages,
202
+ add_generation_prompt=True,
203
+ tokenize=True,
204
+ return_dict=True,
205
+ return_tensors="pt",
206
+ ).to(model.device)
207
+
208
+ with torch.no_grad():
209
+ output_ids = model.generate(
210
+ **inputs,
211
+ max_new_tokens=512,
212
+ do_sample=False,
213
+ )
214
+
215
+ generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]
216
+ response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
217
+ print(response)
218
+ ```
219
+
220
+ ### With Layout-as-Thought (Thinking Mode)
221
+
222
+ Enable thinking mode by passing `enable_thinking=True` to `apply_chat_template`. The model will first generate structured layout analysis (bounding boxes, element types, reading order), then produce the final output.
223
+
224
+ ```python
225
+ image = Image.open("./examples/complex_document.jpg").convert("RGB")
226
+ prompt = "Parse this document to Markdown."
227
+ messages = [
228
+ {
229
+ "role": "user",
230
+ "content": [
231
+ {"type": "image", "image": image},
232
+ {"type": "text", "text": prompt},
233
+ ],
234
+ },
235
+ ]
236
+
237
+ inputs = processor.apply_chat_template(
238
+ messages,
239
+ add_generation_prompt=True,
240
+ tokenize=True,
241
+ return_dict=True,
242
+ return_tensors="pt",
243
+ enable_thinking=True,
244
+ ).to(model.device)
245
+
246
+ with torch.no_grad():
247
+ output_ids = model.generate(
248
+ **inputs,
249
+ max_new_tokens=16384,
250
+ do_sample=False,
251
+ )
252
+
253
+ generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]
254
+ response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
255
+ print(response)
256
+ ```
257
+
258
+ ### Key Information Extraction
259
+
260
+ ```python
261
+ image = Image.open("./examples/invoice.jpg").convert("RGB")
262
+ prompt = "请从图片中提取以下字段信息:姓名、日期、总金额。使用标准JSON格式输出。"
263
+ messages = [
264
+ {
265
+ "role": "user",
266
+ "content": [
267
+ {"type": "image", "image": image},
268
+ {"type": "text", "text": prompt},
269
+ ],
270
+ },
271
+ ]
272
+
273
+ inputs = processor.apply_chat_template(
274
+ messages,
275
+ add_generation_prompt=True,
276
+ tokenize=True,
277
+ return_dict=True,
278
+ return_tensors="pt",
279
+ ).to(model.device)
280
+
281
+ with torch.no_grad():
282
+ output_ids = model.generate(
283
+ **inputs,
284
+ max_new_tokens=16384,
285
+ do_sample=False,
286
+ )
287
+
288
+ generated_ids = output_ids[:, inputs["input_ids"].shape[1]:]
289
+ response = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
290
+ print(response)
291
+ ```
292
+
293
+ ### vLLM Deployment
294
+
295
+ ```bash
296
+ # Serve with vLLM for high-throughput inference
297
+ vllm serve baidu/Qianfan-OCR --trust-remote-code --hf-overrides '{"architectures": ["InternVLChatModel"]}'
298
+ ```
299
+
300
+ ## Skill
301
+
302
+ We provide a [Qianfan OCR Document Intelligence](https://github.com/baidubce/skills/tree/develop/skills/qianfanocr-document-intelligence) skill for image and PDF understanding workflows.
303
+
304
+ It can be used by users of OpenClaw, Claude Code, Codex, and other assistants that support this skill format.
305
+ This skill packages reusable instructions, scripts, and references so the agent can automatically apply Qianfan-powered document intelligence to tasks such as:
306
+
307
+ - document parsing to Markdown
308
+ - layout analysis
309
+ - element recognition
310
+ - general OCR
311
+ - key information extraction
312
+ - chart understanding
313
+ - document VQA
314
+
315
+ The skill is designed for visual understanding tasks over images and PDFs, and includes the execution flow needed to prepare inputs, choose the right analysis mode, and call the bundled CLI tools.
316
+
317
+ ## Citation
318
+
319
+ ```bibtex
320
+ @misc{dong2026qianfanocrunifiedendtoendmodel,
321
+ title={Qianfan-OCR: A Unified End-to-End Model for Document Intelligence},
322
+ author={Daxiang Dong and Mingming Zheng and Dong Xu and Chunhua Luo and Bairong Zhuang and Yuxuan Li and Ruoyun He and Haoran Wang and Wenyu Zhang and Wenbo Wang and Yicheng Wang and Xue Xiong and Ayong Zheng and Xiaoying Zuo and Ziwei Ou and Jingnan Gu and Quanhao Guo and Jianmin Wu and Dawei Yin and Dou Shen},
323
+ year={2026},
324
+ eprint={2603.13398},
325
+ archivePrefix={arXiv},
326
+ primaryClass={cs.CV},
327
+ url={https://arxiv.org/abs/2603.13398},
328
+ }
329
+ ```
330
+
331
+ ## Acknowledgments
332
+
333
+ We thank the Baidu AI Cloud team for infrastructure support, the Baige and Kunlun teams for AI infrastructure assistance, and all contributors to the Qianfan platform.
334
+
335
+ ## License
336
+
337
+ This project is licensed under the Apache License 2.0. See `LICENSE` for the
338
+ full license text.
339
+
340
+ Some bundled third-party source files are licensed under the MIT License. See
341
+ `NOTICE` for the file list and corresponding attribution details.