Image-Text-to-Text
Transformers
Safetensors
qwen3_5
ocr
pdf
markdown
layout
conversational
Eval Results
Instructions to use datalab-to/surya-ocr-2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use datalab-to/surya-ocr-2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="datalab-to/surya-ocr-2") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("datalab-to/surya-ocr-2") model = AutoModelForMultimodalLM.from_pretrained("datalab-to/surya-ocr-2") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use datalab-to/surya-ocr-2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "datalab-to/surya-ocr-2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "datalab-to/surya-ocr-2", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/datalab-to/surya-ocr-2
- SGLang
How to use datalab-to/surya-ocr-2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "datalab-to/surya-ocr-2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "datalab-to/surya-ocr-2", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "datalab-to/surya-ocr-2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "datalab-to/surya-ocr-2", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use datalab-to/surya-ocr-2 with Docker Model Runner:
docker model run hf.co/datalab-to/surya-ocr-2
Sync README + screenshots + chart; bump license threshold to $5M
Browse files- .gitattributes +23 -0
- LICENSE +2 -2
- README.md +411 -81
- corporate.png +3 -0
- corporate_layout.png +3 -0
- corporate_reading.png +3 -0
- corporate_tablerec.png +3 -0
- corporate_text.png +3 -0
- excerpt_text.png +2 -2
- form.png +3 -0
- form_layout.png +3 -0
- form_reading.png +3 -0
- form_tablerec.png +3 -0
- form_text.png +3 -0
- handwritten.png +3 -0
- handwritten_layout.png +3 -0
- handwritten_reading.png +3 -0
- handwritten_tablerec.png +3 -0
- handwritten_text.png +3 -0
- newspaper.png +3 -0
- newspaper_layout.png +3 -0
- newspaper_reading.png +3 -0
- newspaper_text.png +3 -0
- olmocr_size_chart.png +2 -2
- textbook.png +3 -0
- textbook_layout.png +3 -0
- textbook_reading.png +3 -0
- textbook_text.png +3 -0
.gitattributes
CHANGED
|
@@ -38,3 +38,26 @@ excerpt_text.png filter=lfs diff=lfs merge=lfs -text
|
|
| 38 |
excerpt_layout.png filter=lfs diff=lfs merge=lfs -text
|
| 39 |
scanned_tablerec.png filter=lfs diff=lfs merge=lfs -text
|
| 40 |
olmocr_size_chart.png filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
excerpt_layout.png filter=lfs diff=lfs merge=lfs -text
|
| 39 |
scanned_tablerec.png filter=lfs diff=lfs merge=lfs -text
|
| 40 |
olmocr_size_chart.png filter=lfs diff=lfs merge=lfs -text
|
| 41 |
+
newspaper.png filter=lfs diff=lfs merge=lfs -text
|
| 42 |
+
newspaper_text.png filter=lfs diff=lfs merge=lfs -text
|
| 43 |
+
newspaper_layout.png filter=lfs diff=lfs merge=lfs -text
|
| 44 |
+
newspaper_reading.png filter=lfs diff=lfs merge=lfs -text
|
| 45 |
+
textbook.png filter=lfs diff=lfs merge=lfs -text
|
| 46 |
+
textbook_text.png filter=lfs diff=lfs merge=lfs -text
|
| 47 |
+
textbook_layout.png filter=lfs diff=lfs merge=lfs -text
|
| 48 |
+
textbook_reading.png filter=lfs diff=lfs merge=lfs -text
|
| 49 |
+
form.png filter=lfs diff=lfs merge=lfs -text
|
| 50 |
+
form_text.png filter=lfs diff=lfs merge=lfs -text
|
| 51 |
+
form_layout.png filter=lfs diff=lfs merge=lfs -text
|
| 52 |
+
form_reading.png filter=lfs diff=lfs merge=lfs -text
|
| 53 |
+
form_tablerec.png filter=lfs diff=lfs merge=lfs -text
|
| 54 |
+
handwritten.png filter=lfs diff=lfs merge=lfs -text
|
| 55 |
+
handwritten_text.png filter=lfs diff=lfs merge=lfs -text
|
| 56 |
+
handwritten_layout.png filter=lfs diff=lfs merge=lfs -text
|
| 57 |
+
handwritten_reading.png filter=lfs diff=lfs merge=lfs -text
|
| 58 |
+
handwritten_tablerec.png filter=lfs diff=lfs merge=lfs -text
|
| 59 |
+
corporate.png filter=lfs diff=lfs merge=lfs -text
|
| 60 |
+
corporate_text.png filter=lfs diff=lfs merge=lfs -text
|
| 61 |
+
corporate_layout.png filter=lfs diff=lfs merge=lfs -text
|
| 62 |
+
corporate_reading.png filter=lfs diff=lfs merge=lfs -text
|
| 63 |
+
corporate_tablerec.png filter=lfs diff=lfs merge=lfs -text
|
LICENSE
CHANGED
|
@@ -53,7 +53,7 @@ As conditions to the Licenses set forth in this Agreement, You agree not to use,
|
|
| 53 |
(a) In any way that violates any applicable national, federal, state, local or international law or regulation; or
|
| 54 |
(b) to directly or indirectly infringe or misappropriate any third party intellectual property rights (including those of Licensor or any Contributor)
|
| 55 |
2. Commercial:
|
| 56 |
-
(a) for any purpose if You (your employer, or the entity you are affiliated with) generated more than
|
| 57 |
-
(b) for any purpose if You (your employer, or the entity you are affiliated with) has raised more than
|
| 58 |
(c) for any purpose if You (your employer, or the entity you are affiliated with) provides or otherwise makes available any product or service that competes with any product or service offered by or made available by Licensor or any of its affiliates.
|
| 59 |
Commercial and broader use licenses may be available from Licensor at the following URL: https://www.datalab.to/
|
|
|
|
| 53 |
(a) In any way that violates any applicable national, federal, state, local or international law or regulation; or
|
| 54 |
(b) to directly or indirectly infringe or misappropriate any third party intellectual property rights (including those of Licensor or any Contributor)
|
| 55 |
2. Commercial:
|
| 56 |
+
(a) for any purpose if You (your employer, or the entity you are affiliated with) generated more than five million US Dollars ($5,000,000) in gross revenue in the prior year, except where Your Use is limited to personal use or research purposes;
|
| 57 |
+
(b) for any purpose if You (your employer, or the entity you are affiliated with) has raised more than five million US dollars ($5,000,000) in total equity or debt funding from any source, except where Your Use is limited to personal use or research purposes; or
|
| 58 |
(c) for any purpose if You (your employer, or the entity you are affiliated with) provides or otherwise makes available any product or service that competes with any product or service offered by or made available by Licensor or any of its affiliates.
|
| 59 |
Commercial and broader use licenses may be available from Licensor at the following URL: https://www.datalab.to/
|
README.md
CHANGED
|
@@ -1,143 +1,473 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
-
|
| 8 |
-
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
--
|
|
|
|
|
|
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
-
#
|
| 23 |
|
| 24 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
<img src="olmocr_size_chart.png" width="700"/>
|
| 27 |
|
| 28 |
-
| Model | Params | Score |
|
| 29 |
-
|-----------------------------|----------:|---------:|
|
| 30 |
-
| Infinity-Parser2-Pro | 35.1B | 87.6 |
|
| 31 |
-
| Chandra OCR 2 (Datalab) | 5.3B | 85.9 |
|
| 32 |
-
| dots.mocr | 3.0B | 83.9 |
|
| 33 |
-
| LightOnOCR 2-1B \* | 1.0B | 83.2 |
|
| 34 |
-
| **Surya OCR 2** (Datalab) | **0.69B** | **83.1** |
|
| 35 |
-
| Chandra OCR 1 (Datalab) | 9.0B | 83.1 |
|
| 36 |
-
| olmOCR (anchored) | 8.3B | 77.4 |
|
| 37 |
-
| GOT OCR | 0.6B | 48.3 |
|
| 38 |
|
| 39 |
-
|
|
|
|
|
|
|
| 40 |
|
| 41 |
-
|
|
|
|
|
|
|
| 42 |
|
| 43 |
-
| ArXiv | Base | Hdr/Ftr | TinyTxt | MultCol | OldScan | OldMath | Tables |
|
| 44 |
-
|------:|-----:|--------:|--------:|--------:|--------:|--------:|-------:|
|
| 45 |
-
| 88.7 | 99.9 | 92.1 | 86.4 | 82.6 | 42.8 | 85.8 | 86.6 |
|
| 46 |
|
| 47 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
-
|
| 50 |
-
- Line-level text detection (separate small torch model)
|
| 51 |
-
- Layout analysis (Caption / Section-Header / Table / Equation / etc.) with reading order
|
| 52 |
-
- Table recognition: rows + columns (simple mode) or full `<table>` HTML with spanning cells (full mode)
|
| 53 |
-
- Inline math in `<math>…</math>` tags (KaTeX-compatible LaTeX) — no separate LaTeX OCR pass
|
| 54 |
-
- Two backends: `vllm` for NVIDIA GPUs, `llama.cpp` for Apple Silicon / CPU
|
| 55 |
|
| 56 |
-
|
| 57 |
-
|:---:|:---:|
|
| 58 |
-
| <img src="excerpt.png" width="320"/> | <img src="excerpt_text.png" width="320"/> |
|
| 59 |
-
| <img src="excerpt_layout.png" width="320"/> | <img src="scanned_tablerec.png" width="320"/> |
|
| 60 |
|
| 61 |
-
#
|
|
|
|
|
|
|
| 62 |
|
| 63 |
```shell
|
| 64 |
pip install surya-ocr
|
| 65 |
-
surya_ocr path/to/document.pdf # writes results.json with layout + text per page
|
| 66 |
```
|
| 67 |
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
```shell
|
|
|
|
| 71 |
surya_gui
|
| 72 |
```
|
| 73 |
|
| 74 |
-
|
| 75 |
|
| 76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
```python
|
| 79 |
from PIL import Image
|
| 80 |
from surya.inference import SuryaInferenceManager
|
| 81 |
from surya.recognition import RecognitionPredictor
|
| 82 |
|
| 83 |
-
manager = SuryaInferenceManager()
|
| 84 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
```
|
| 90 |
|
| 91 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
```python
|
|
|
|
|
|
|
| 94 |
from surya.layout import LayoutPredictor
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
```
|
| 99 |
|
| 100 |
-
Table
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
```python
|
|
|
|
|
|
|
| 103 |
from surya.table_rec import TableRecPredictor
|
| 104 |
-
table = TableRecPredictor(manager)
|
| 105 |
|
| 106 |
-
|
| 107 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
```
|
| 109 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
## Throughput
|
| 111 |
|
| 112 |
-
Full-page OCR, 96 DPI input (~2,400 output tokens/page average), measured
|
|
|
|
| 113 |
|
| 114 |
### RTX 5090 (vllm)
|
| 115 |
|
| 116 |
-
`vllm/vllm-openai:v0.20.1`, single RTX 5090 (32 GB).
|
| 117 |
-
|
| 118 |
-
| Concurrency | Pages/s | Tokens/s | p50 (ms) | p95 (ms) | avg tok/page |
|
| 119 |
-
|---:|---:|---:|---:|---:|---:|
|
| 120 |
-
| 32 | 3.67 | 8,870 | 6,744 | 21,741 | 2,420 |
|
| 121 |
-
| 64 | 4.67 | 11,280 | 10,741 | 34,639 | 2,414 |
|
| 122 |
-
| **128** | **5.35** | **12,884** | 18,915 | 42,538 | 2,410 |
|
| 123 |
|
| 124 |
-
|
|
|
|
|
|
|
| 125 |
|
| 126 |
### Apple Silicon (llama.cpp / Metal)
|
| 127 |
|
| 128 |
`llama-server` with Metal backend.
|
| 129 |
|
| 130 |
-
| `--parallel` |
|
| 131 |
-
|---:|---:|---:|---:|---:|---:|---:|
|
| 132 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 133 |
|
| 134 |
-
|
| 135 |
|
| 136 |
-
|
| 137 |
|
| 138 |
-
|
| 139 |
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
|
| 143 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<h1 align="center">Datalab</h1>
|
| 2 |
+
<p align="center">
|
| 3 |
+
<strong>State of the Art models for Document Intelligence</strong>
|
| 4 |
+
</p>
|
| 5 |
+
<p align="center">
|
| 6 |
+
<a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/Code%20License-Apache--2.0-green.svg" alt="Code License"></a>
|
| 7 |
+
<a href="https://www.datalab.to/pricing"><img src="https://img.shields.io/badge/Model%20License-OpenRAIL--M-blue.svg" alt="Model License"></a>
|
| 8 |
+
<a href="https://discord.gg/KuZwXNGnfH"><img src="https://img.shields.io/badge/Discord-Join%20us-5865F2?logo=discord&logoColor=white" alt="Discord"></a>
|
| 9 |
+
</p>
|
| 10 |
+
<p align="center">
|
| 11 |
+
<a href="https://www.datalab.to"><img src="https://img.shields.io/badge/Homepage-datalab.to-blue" alt="Homepage"></a>
|
| 12 |
+
<a href="https://documentation.datalab.to"><img src="https://img.shields.io/badge/Docs-Read%20the%20docs-blue" alt="Docs"></a>
|
| 13 |
+
<a href="https://www.datalab.to/playground"><img src="https://img.shields.io/badge/Datalab Playground-Try%20it-orange" alt="Datalab Playground"></a>
|
| 14 |
+
</p>
|
| 15 |
|
| 16 |
+
<hr/>
|
| 17 |
|
| 18 |
+
# Surya
|
| 19 |
|
| 20 |
+
Surya is an OCR toolkit powered by a 650M param model that does:
|
| 21 |
|
| 22 |
+
- Full-page OCR, scoring 83.3% on [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench) (top under 3B params)
|
| 23 |
+
- Multilingual OCR - scores 87.2% on an internal benchmark set of 91 languages (more [here](#multilingual))
|
| 24 |
+
- Line-level text detection
|
| 25 |
+
- Layout analysis (table, image, header, etc.) with reading order
|
| 26 |
+
- Table recognition (rows + columns)
|
| 27 |
|
| 28 |
+
It works on a range of documents (see [usage](#usage) and [benchmarks](#benchmarks)).
|
| 29 |
|
| 30 |
+
## Try Datalab's Managed Platform
|
| 31 |
+
|
| 32 |
+
Our managed platform runs both Surya, and variants of our highest accuracy model, [Chandra](https://github.com/datalab-to/chandra).
|
| 33 |
+
|
| 34 |
+
Get started with **$5 in free credits** — [sign up](https://www.datalab.to/?utm_source=gh-surya) (takes under 30 seconds) or try our free [public playground](https://www.datalab.to/playground?utm_source=gh-surya).
|
| 35 |
+
|
| 36 |
+
Commercial self-hosting of the model weights requires a license — see [Commercial usage](#commercial-usage). For on-prem licensing, [contact us](https://www.datalab.to/contact?utm_source=gh-surya-onprem). If you have high volume workloads, we offer a batch processing service that can process 1B+ pages per week.
|
| 37 |
+
|
| 38 |
+
## Model Information
|
| 39 |
|
| 40 |
<img src="olmocr_size_chart.png" width="700"/>
|
| 41 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
| Detection | OCR |
|
| 44 |
+
|:----------------------------------------------------------------:|:-----------------------------------------------------------------------:|
|
| 45 |
+
| <img src="excerpt.png" width="280"/> | <img src="excerpt_text.png" width="280"/> |
|
| 46 |
|
| 47 |
+
| Layout | Table Recognition |
|
| 48 |
+
|:------------------------------------------------------------------:|:-------------------------------------------------------------:|
|
| 49 |
+
| <img src="excerpt_layout.png" width="280"/> | <img src="scanned_tablerec.png" width="280"/> |
|
| 50 |
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
+
Surya is named for the [Hindu sun god](https://en.wikipedia.org/wiki/Surya), who has universal vision.
|
| 53 |
+
|
| 54 |
+
## Examples
|
| 55 |
+
|
| 56 |
+
Each row links to five annotated views of the same page: text-line detection, OCR, layout, reading order, and (when present) table recognition.
|
| 57 |
+
|
| 58 |
+
| Name | Detection | OCR | Layout | Order | Table Rec |
|
| 59 |
+
|-------------------|:-----------------------------------:|------------------------------------------:|---------------------------------------------:|------------------------------------------------:|------------------------------------------------:|
|
| 60 |
+
| Newspaper | [Image](newspaper.png) | [Image](newspaper_text.png) | [Image](newspaper_layout.png) | [Image](newspaper_reading.png) | |
|
| 61 |
+
| Textbook | [Image](textbook.png) | [Image](textbook_text.png) | [Image](textbook_layout.png) | [Image](textbook_reading.png) | |
|
| 62 |
+
| Tax Form | [Image](form.png) | [Image](form_text.png) | [Image](form_layout.png) | [Image](form_reading.png) | [Image](form_tablerec.png) |
|
| 63 |
+
| Handwritten Notes | [Image](handwritten.png) | [Image](handwritten_text.png) | [Image](handwritten_layout.png) | [Image](handwritten_reading.png) | [Image](handwritten_tablerec.png) |
|
| 64 |
+
| Corporate Doc | [Image](corporate.png) | [Image](corporate_text.png) | [Image](corporate_layout.png) | [Image](corporate_reading.png) | [Image](corporate_tablerec.png) |
|
| 65 |
|
| 66 |
+
# Commercial usage
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
+
The Surya code is licensed under Apache 2.0. The model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $5M funding/revenue). For broader commercial licensing of the model weights, visit our pricing page [here](https://www.datalab.to/pricing?utm_source=gh-surya).
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
+
# Installation
|
| 71 |
+
|
| 72 |
+
Install with:
|
| 73 |
|
| 74 |
```shell
|
| 75 |
pip install surya-ocr
|
|
|
|
| 76 |
```
|
| 77 |
|
| 78 |
+
## Upgrading from Surya v1
|
| 79 |
+
|
| 80 |
+
If you have v1 code, you can migrate to this:
|
| 81 |
+
|
| 82 |
+
```python
|
| 83 |
+
# v2
|
| 84 |
+
from surya.inference import SuryaInferenceManager
|
| 85 |
+
from surya.recognition import RecognitionPredictor
|
| 86 |
+
|
| 87 |
+
manager = SuryaInferenceManager() # auto-spawns vllm or llama-server
|
| 88 |
+
rec = RecognitionPredictor(manager)
|
| 89 |
+
predictions = rec([image])
|
| 90 |
+
```
|
| 91 |
+
|
| 92 |
+
What's different:
|
| 93 |
+
- `SuryaInferenceManager` replaces `FoundationPredictor`. Same manager instance is shared across `LayoutPredictor`, `RecognitionPredictor`, `TableRecPredictor`.
|
| 94 |
+
- Output schemas changed: see the per-section JSON tables below. Highlights — `text_lines` → `blocks` (with `html`); layout dropped `top_k`, added `count`; table_rec dropped `is_header` / `colspan` / `rowspan` from cells.
|
| 95 |
+
|
| 96 |
+
# Usage
|
| 97 |
+
|
| 98 |
+
Surya 2 runs layout, OCR, and table recognition through a single VLM served
|
| 99 |
+
by `vllm` (GPU) or `llama.cpp` (CPU / Apple Silicon). The inference manager
|
| 100 |
+
will spawn one for you on first use; you can also point it at an existing
|
| 101 |
+
server via `SURYA_INFERENCE_URL=http://host:port/v1`.
|
| 102 |
+
|
| 103 |
+
- Inspect the settings in `surya/settings.py`. You can override any setting via env var (e.g. `SURYA_INFERENCE_BACKEND=vllm`).
|
| 104 |
+
- Text detection and OCR errors are separate models.
|
| 105 |
+
|
| 106 |
+
## Interactive App
|
| 107 |
+
|
| 108 |
+
I've included a streamlit app that lets you interactively try Surya on images or PDF files. Run it with:
|
| 109 |
|
| 110 |
```shell
|
| 111 |
+
pip install streamlit pdftext
|
| 112 |
surya_gui
|
| 113 |
```
|
| 114 |
|
| 115 |
+
## OCR (text recognition)
|
| 116 |
|
| 117 |
+
This command will write out a json file with the detected text and bboxes:
|
| 118 |
+
|
| 119 |
+
```shell
|
| 120 |
+
surya_ocr DATA_PATH
|
| 121 |
+
```
|
| 122 |
+
|
| 123 |
+
- `DATA_PATH` can be an image, pdf, or folder of images/pdfs
|
| 124 |
+
- `--images` will save images of the pages and detected blocks (optional)
|
| 125 |
+
- `--output_dir` specifies the directory to save results to instead of the default
|
| 126 |
+
- `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
|
| 127 |
+
|
| 128 |
+
The `results.json` file contains a dict keyed by input filename (no extension). Each value is a list of page dicts. Each page dict contains:
|
| 129 |
+
|
| 130 |
+
- `blocks` - per-block OCR results in reading order
|
| 131 |
+
- `label` - canonicalized layout label (e.g. `Text`, `SectionHeader`, `Table`, `Equation`, `Picture`, `Form`, `PageHeader`, ...). See `surya/layout/label.py:LAYOUT_PRED_RELABEL` for the full canonical-name set.
|
| 132 |
+
- `raw_label` - original label emitted by the model, before canonicalization
|
| 133 |
+
- `reading_order` - 0-indexed position in layout output
|
| 134 |
+
- `html` - block content as HTML (math wrapped in `<math>...</math>`, tables as `<table>...</table>`, etc.). `""` if the block was skipped
|
| 135 |
+
- `polygon` - 4-corner polygon in `[[x0,y0],[x1,y0],[x1,y1],[x0,y1]]` order
|
| 136 |
+
- `bbox` - axis-aligned `[x0, y0, x1, y1]` derived from the polygon
|
| 137 |
+
- `confidence` - mean per-token probability across the block's decode (0-1)
|
| 138 |
+
- `skipped` - true if the block was a visual label (e.g. Picture) and not OCR'd
|
| 139 |
+
- `error` - true if the block OCR call failed
|
| 140 |
+
- `image_bbox` - `[0, 0, width, height]` for the page image
|
| 141 |
+
|
| 142 |
+
**Performance tips**
|
| 143 |
+
|
| 144 |
+
Throughput is governed by the inference backend, not a `RECOGNITION_BATCH_SIZE` env var. With `vllm`, raise `--max-num-seqs` / `--max-num-batched-tokens` (or `SURYA_INFERENCE_PARALLEL` on the client side) to keep more pages in flight. With `llama.cpp`, set `SURYA_INFERENCE_PARALLEL` to match `--parallel` on `llama-server`.
|
| 145 |
+
|
| 146 |
+
### From python
|
| 147 |
|
| 148 |
```python
|
| 149 |
from PIL import Image
|
| 150 |
from surya.inference import SuryaInferenceManager
|
| 151 |
from surya.recognition import RecognitionPredictor
|
| 152 |
|
| 153 |
+
manager = SuryaInferenceManager()
|
| 154 |
+
recognition_predictor = RecognitionPredictor(manager)
|
| 155 |
+
|
| 156 |
+
# Default: full-page OCR. One VLM call per page; returns layout + content as
|
| 157 |
+
# HTML <div data-bbox=... data-label=...> blocks.
|
| 158 |
+
predictions = recognition_predictor([Image.open(IMAGE_PATH)])
|
| 159 |
|
| 160 |
+
# Block mode: pre-run layout, then per-block OCR. Auto-selected when
|
| 161 |
+
# `layout_results` is passed.
|
| 162 |
+
from surya.layout import LayoutPredictor
|
| 163 |
+
layout = LayoutPredictor(manager)
|
| 164 |
+
layouts = layout([Image.open(IMAGE_PATH)])
|
| 165 |
+
predictions = recognition_predictor([Image.open(IMAGE_PATH)], layouts)
|
| 166 |
+
```
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
## Text line detection
|
| 170 |
+
|
| 171 |
+
This command will write out a json file with the detected bboxes.
|
| 172 |
+
|
| 173 |
+
```shell
|
| 174 |
+
surya_detect DATA_PATH
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
- `DATA_PATH` can be an image, pdf, or folder of images/pdfs
|
| 178 |
+
- `--images` will save images of the pages and detected text lines (optional)
|
| 179 |
+
- `--output_dir` specifies the directory to save results to instead of the default
|
| 180 |
+
- `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
|
| 181 |
+
|
| 182 |
+
The `results.json` file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:
|
| 183 |
+
|
| 184 |
+
- `bboxes` - detected bounding boxes for text
|
| 185 |
+
- `bbox` - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
|
| 186 |
+
- `polygon` - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
|
| 187 |
+
- `confidence` - the confidence of the model in the detected text (0-1)
|
| 188 |
+
- `vertical_lines` - vertical lines detected in the document
|
| 189 |
+
- `bbox` - the axis-aligned line coordinates.
|
| 190 |
+
- `page` - the page number in the file
|
| 191 |
+
- `image_bbox` - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.
|
| 192 |
+
|
| 193 |
+
**Performance tips**
|
| 194 |
+
|
| 195 |
+
Detection is a torch model. `DETECTOR_BATCH_SIZE` defaults to an auto-picked value at runtime; override the env var to control VRAM usage on GPU and raise it on larger cards.
|
| 196 |
+
|
| 197 |
+
### From python
|
| 198 |
+
|
| 199 |
+
```python
|
| 200 |
+
from PIL import Image
|
| 201 |
+
from surya.detection import DetectionPredictor
|
| 202 |
+
|
| 203 |
+
det_predictor = DetectionPredictor()
|
| 204 |
+
predictions = det_predictor([Image.open(IMAGE_PATH)])
|
| 205 |
+
```
|
| 206 |
+
|
| 207 |
+
## Layout and reading order
|
| 208 |
+
|
| 209 |
+
This command will write out a json file with the detected layout and reading order.
|
| 210 |
+
|
| 211 |
+
```shell
|
| 212 |
+
surya_layout DATA_PATH
|
| 213 |
```
|
| 214 |
|
| 215 |
+
- `DATA_PATH` can be an image, pdf, or folder of images/pdfs
|
| 216 |
+
- `--images` will save images of the pages and detected text lines (optional)
|
| 217 |
+
- `--output_dir` specifies the directory to save results to instead of the default
|
| 218 |
+
- `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
|
| 219 |
+
|
| 220 |
+
The `results.json` file contains a dict keyed by input filename (no extension). Each value is a list of page dicts. Each page dict contains:
|
| 221 |
+
|
| 222 |
+
- `bboxes` - layout boxes in reading order
|
| 223 |
+
- `polygon` - 4-corner polygon `[[x0,y0],[x1,y0],[x1,y1],[x0,y1]]`
|
| 224 |
+
- `bbox` - axis-aligned `[x0, y0, x1, y1]` derived from the polygon
|
| 225 |
+
- `label` - canonicalized label. One of `Caption`, `Footnote`, `Equation`, `ListGroup`, `PageHeader`, `PageFooter`, `Picture`, `SectionHeader`, `Table`, `Text`, `Figure`, `Code`, `Form`, `TableOfContents`, `ChemicalBlock`, `Diagram`, `Bibliography`, `BlankPage`
|
| 226 |
+
- `raw_label` - original label emitted by the model
|
| 227 |
+
- `position` - 0-indexed reading order
|
| 228 |
+
- `count` - model's token estimate for OCR'ing this block (rounded to multiples of 50; used to size the per-block decode budget)
|
| 229 |
+
- `confidence` - mean per-token probability across the layout decode (0-1)
|
| 230 |
+
- `image_bbox` - `[0, 0, width, height]`
|
| 231 |
+
- `raw` - raw JSON the layout model emitted, for debugging
|
| 232 |
+
- `error` - true if the layout call failed
|
| 233 |
+
|
| 234 |
+
**Performance tips**
|
| 235 |
+
|
| 236 |
+
Layout runs through the shared inference backend. Throughput tuning is the same as OCR — see Performance tips above.
|
| 237 |
+
|
| 238 |
+
### From python
|
| 239 |
|
| 240 |
```python
|
| 241 |
+
from PIL import Image
|
| 242 |
+
from surya.inference import SuryaInferenceManager
|
| 243 |
from surya.layout import LayoutPredictor
|
| 244 |
+
|
| 245 |
+
layout_predictor = LayoutPredictor(SuryaInferenceManager())
|
| 246 |
+
layout_predictions = layout_predictor([Image.open(IMAGE_PATH)])
|
| 247 |
```
|
| 248 |
|
| 249 |
+
## Table Recognition
|
| 250 |
+
|
| 251 |
+
This command will write out a json file with the detected table cells and row/column ids, along with row/column bounding boxes. If you want to get cell positions and text, along with nice formatting, check out the [marker](https://github.com/datalab-to/marker) repo. You can use the `TableConverter` to detect and extract tables in images and PDFs. It supports output in json (with bboxes), markdown, and html.
|
| 252 |
+
|
| 253 |
+
```shell
|
| 254 |
+
surya_table DATA_PATH
|
| 255 |
+
```
|
| 256 |
+
|
| 257 |
+
- `DATA_PATH` can be an image, pdf, or folder of images/pdfs
|
| 258 |
+
- `--images` will save annotated row + column overlays alongside the json (optional)
|
| 259 |
+
- `--output_dir` specifies the directory to save results to instead of the default
|
| 260 |
+
- `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
|
| 261 |
+
- `--skip_table_detection` tells table recognition not to detect tables first. Use this if your image is already cropped to a table.
|
| 262 |
+
|
| 263 |
+
The `results.json` file contains a dict keyed by input filename (no extension). Each value is a list of per-table dicts. Each table dict contains:
|
| 264 |
+
|
| 265 |
+
- `rows` - detected table rows in reading order
|
| 266 |
+
- `polygon` / `bbox` - row geometry (same convention as everywhere else)
|
| 267 |
+
- `row_id` - 0-indexed row id
|
| 268 |
+
- `cols` - detected table columns
|
| 269 |
+
- `polygon` / `bbox` - column geometry
|
| 270 |
+
- `col_id` - 0-indexed column id
|
| 271 |
+
- `cells` - geometric row × column intersections (simple mode)
|
| 272 |
+
- `polygon` / `bbox` - cell geometry
|
| 273 |
+
- `row_id`, `col_id`, `cell_id`
|
| 274 |
+
- `html` - full `<table>...</table>` HTML (only populated when `predict_full` is used; handles spanning cells / header rows). `null` in simple mode.
|
| 275 |
+
- `mode` - `"simple"` or `"full"`
|
| 276 |
+
- `image_bbox` - the table crop bbox
|
| 277 |
+
- `error` - true if the table_rec call failed
|
| 278 |
+
- `raw` - raw model output, for debugging
|
| 279 |
+
|
| 280 |
+
**Performance tips**
|
| 281 |
+
|
| 282 |
+
Table recognition routes through the shared VLM. Throughput tuning is the same as OCR.
|
| 283 |
+
|
| 284 |
+
### From python
|
| 285 |
|
| 286 |
```python
|
| 287 |
+
from PIL import Image
|
| 288 |
+
from surya.inference import SuryaInferenceManager
|
| 289 |
from surya.table_rec import TableRecPredictor
|
|
|
|
| 290 |
|
| 291 |
+
table_rec_predictor = TableRecPredictor(SuryaInferenceManager())
|
| 292 |
+
|
| 293 |
+
# Default: rows + columns only, cells derived from intersections.
|
| 294 |
+
table_predictions = table_rec_predictor([Image.open(IMAGE_PATH)])
|
| 295 |
+
|
| 296 |
+
# Or full HTML output (better for spanning cells / headers):
|
| 297 |
+
# table_predictions = table_rec_predictor.predict_full([image])
|
| 298 |
```
|
| 299 |
|
| 300 |
+
## Math / equations
|
| 301 |
+
|
| 302 |
+
Surya 2 handles math inline as part of full-page OCR — recognized equations
|
| 303 |
+
come back inside `<math>...</math>` tags in the same HTML output as
|
| 304 |
+
surrounding prose, in KaTeX-compatible LaTeX. No separate LaTeX OCR pass.
|
| 305 |
+
|
| 306 |
+
# Inference Backends
|
| 307 |
+
|
| 308 |
+
Layout / OCR / table_rec all share one VLM, served either by `vllm` (GPU) or `llama.cpp` (CPU / Apple Silicon). The `SuryaInferenceManager` will spawn one automatically; you can also point at a pre-running server:
|
| 309 |
+
|
| 310 |
+
```bash
|
| 311 |
+
# Attach to an existing vllm
|
| 312 |
+
export SURYA_INFERENCE_BACKEND=vllm
|
| 313 |
+
export SURYA_INFERENCE_URL=http://localhost:8000/v1
|
| 314 |
+
```
|
| 315 |
+
|
| 316 |
+
| Setting | Default | Notes |
|
| 317 |
+
|-----------------------------------|-----------------------------------|--------------------------------------------------------|
|
| 318 |
+
| `SURYA_INFERENCE_BACKEND` | auto (vllm if NVIDIA, else llamacpp) | `vllm` \| `llamacpp` \| unset (auto) |
|
| 319 |
+
| `SURYA_INFERENCE_URL` | (auto-spawn) | Attach to a running OpenAI-compatible server |
|
| 320 |
+
| `SURYA_INFERENCE_PARALLEL` | 8 | Client-side concurrency to the backend |
|
| 321 |
+
| `SURYA_GUIDED_LAYOUT` | true | JSON-schema-constrained layout decode |
|
| 322 |
+
|
| 323 |
+
# Limitations
|
| 324 |
+
|
| 325 |
+
- This is specialized for document OCR. Performance on photos or natural scenes is not the goal.
|
| 326 |
+
- Layout / OCR / table_rec all need a running inference backend (vllm or llama.cpp). Detection runs purely on torch and works without it.
|
| 327 |
+
|
| 328 |
+
## Troubleshooting
|
| 329 |
+
|
| 330 |
+
If OCR isn't working properly:
|
| 331 |
+
|
| 332 |
+
- Try increasing resolution of the image so the text is bigger. If the resolution is already very high, try decreasing it to no more than a `2048px` width.
|
| 333 |
+
- Preprocessing the image (binarizing, deskewing, etc) can help with very old/blurry images.
|
| 334 |
+
- You can adjust `DETECTOR_BLANK_THRESHOLD` and `DETECTOR_TEXT_THRESHOLD` if you don't get good results. `DETECTOR_BLANK_THRESHOLD` controls the space between lines - any prediction below this number will be considered blank space. `DETECTOR_TEXT_THRESHOLD` controls how text is joined - any number above this is considered text. `DETECTOR_TEXT_THRESHOLD` should always be higher than `DETECTOR_BLANK_THRESHOLD`, and both should be in the 0-1 range. Looking at the heatmap from the debug output of the detector can tell you how to adjust these (if you see faint things that look like boxes, lower the thresholds, and if you see bboxes being joined together, raise the thresholds).
|
| 335 |
+
|
| 336 |
+
# Manual install
|
| 337 |
+
|
| 338 |
+
If you want to develop surya, you can install it manually with [uv](https://docs.astral.sh/uv/):
|
| 339 |
+
|
| 340 |
+
```bash
|
| 341 |
+
git clone https://github.com/datalab-to/surya.git
|
| 342 |
+
cd surya
|
| 343 |
+
uv sync --group dev # installs runtime + dev deps
|
| 344 |
+
uv run surya_ocr ... # or `uv shell` to enter the venv
|
| 345 |
+
```
|
| 346 |
+
|
| 347 |
+
# Benchmarks
|
| 348 |
+
|
| 349 |
+
Surya 2 is a single VLM that handles layout analysis, OCR (full-page or
|
| 350 |
+
per-block), and table recognition in one model. We evaluate end-to-end on
|
| 351 |
+
[olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench) — the
|
| 352 |
+
standard quality benchmark for document parsers.
|
| 353 |
+
|
| 354 |
+
## olmOCR-bench
|
| 355 |
+
|
| 356 |
+
Best-in-class accuracy under 1B parameters; pareto-optimal vs every model 3B and below.
|
| 357 |
+
|
| 358 |
+
| Model | Params | Score |
|
| 359 |
+
|-----------------------------|----------:|---------:|
|
| 360 |
+
| Infinity-Parser2-Pro | 35.1B | 87.6 |
|
| 361 |
+
| Chandra OCR 2 (Datalab) | 5.3B | 85.9 |
|
| 362 |
+
| dots.mocr | 3.0B | 83.9 |
|
| 363 |
+
| **Surya OCR 2** (Datalab) | **0.65B** | **83.3** |
|
| 364 |
+
| LightOnOCR 2-1B \* | 1.0B | 83.2 |
|
| 365 |
+
| Chandra OCR 1 (Datalab) | 9.0B | 83.1 |
|
| 366 |
+
| olmOCR (anchored) | 8.3B | 77.4 |
|
| 367 |
+
| GOT OCR | 0.6B | 48.3 |
|
| 368 |
+
|
| 369 |
+
\* **LightOnOCR 2-1B** uses a different benchmark methodology than the other entries (see their [release notes](https://huggingface.co/lightonai/LightOnOCR-2-1B)); the score is included for context but is not directly comparable.
|
| 370 |
+
|
| 371 |
+
Comparison scores from the [olmOCR-bench dataset card](https://huggingface.co/datasets/allenai/olmOCR-bench).
|
| 372 |
+
|
| 373 |
+
Surya 2, per-source pass rate on the `default` preset (8,413 tests total):
|
| 374 |
+
|
| 375 |
+
| ArXiv | Base | Hdr/Ftr | TinyTxt | MultCol | OldScan | OldMath | Tables |
|
| 376 |
+
|------:|-----:|--------:|--------:|--------:|--------:|--------:|-------:|
|
| 377 |
+
| 88.3 | 99.7 | 92.5 | 93.7 | 82.4 | 41.8 | 81.4 | 86.6 |
|
| 378 |
+
|
| 379 |
+
## Multilingual
|
| 380 |
+
|
| 381 |
+
We also evaluate Surya 2 against a 91-language internal benchmark covering
|
| 382 |
+
text accuracy, layout, tables, math, and reading order in documents drawn
|
| 383 |
+
from each language.
|
| 384 |
+
|
| 385 |
+
**Overall pass rate: 87.2% across 91 languages.** 38 of the
|
| 386 |
+
91 languages score ≥ 90%; 76 score ≥ 80%.
|
| 387 |
+
|
| 388 |
+
Top 15 widely-spoken languages:
|
| 389 |
+
|
| 390 |
+
| Code | Language | Score |
|
| 391 |
+
|------|-------------|------:|
|
| 392 |
+
| `ar` | Arabic | 72.7% |
|
| 393 |
+
| `bn` | Bengali | 82.7% |
|
| 394 |
+
| `zh` | Chinese | 82.5% |
|
| 395 |
+
| `en` | English | 92.3% |
|
| 396 |
+
| `fr` | French | 89.3% |
|
| 397 |
+
| `de` | German | 89.7% |
|
| 398 |
+
| `hi` | Hindi | 82.2% |
|
| 399 |
+
| `it` | Italian | 93.0% |
|
| 400 |
+
| `ja` | Japanese | 86.2% |
|
| 401 |
+
| `ko` | Korean | 86.7% |
|
| 402 |
+
| `fa` | Persian | 82.3% |
|
| 403 |
+
| `pt` | Portuguese | 86.1% |
|
| 404 |
+
| `ru` | Russian | 88.8% |
|
| 405 |
+
| `es` | Spanish | 90.7% |
|
| 406 |
+
| `vi` | Vietnamese | 73.2% |
|
| 407 |
+
|
| 408 |
+
See [https://github.com/datalab-to/surya/blob/main/static/docs/multilingual.md](https://github.com/datalab-to/surya/blob/main/static/docs/multilingual.md) for the full 91-language table.
|
| 409 |
+
|
| 410 |
## Throughput
|
| 411 |
|
| 412 |
+
Full-page OCR, 96 DPI input (~2,400 output tokens/page average), measured
|
| 413 |
+
client-side against a running inference server.
|
| 414 |
|
| 415 |
### RTX 5090 (vllm)
|
| 416 |
|
| 417 |
+
`vllm/vllm-openai:v0.20.1`, single RTX 5090 (32 GB).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 418 |
|
| 419 |
+
| Concurrency | Pages/s | Tokens/s | p50 (ms) | p95 (ms) | avg tok/page |
|
| 420 |
+
|------------:|--------:|----------:|---:|---:|---:|
|
| 421 |
+
| 128 | 5.35 | 12,884 | 18,915 | 42,538 | 2,410 |
|
| 422 |
|
| 423 |
### Apple Silicon (llama.cpp / Metal)
|
| 424 |
|
| 425 |
`llama-server` with Metal backend.
|
| 426 |
|
| 427 |
+
| `--parallel` | Pages/s | Tokens/s | p50 (ms) | p95 (ms) | avg tok/page | Power |
|
| 428 |
+
|-------------:|---------:|---------:|---:|---:|---:|---:|
|
| 429 |
+
| 8 | 0.108 | 254 | 59,313 | 129,173 | 2,360 | ~30 W |
|
| 430 |
+
|
| 431 |
+
## Reproducing
|
| 432 |
+
|
| 433 |
+
We score Surya 2 on olmOCR-bench by serving the model with `vllm` (or
|
| 434 |
+
`llama.cpp`) and running the olmOCR-bench harness from
|
| 435 |
+
[allenai/olmocr](https://github.com/allenai/olmocr), with some adjustments applied to account for our output HTML format.
|
| 436 |
+
|
| 437 |
+
# Training
|
| 438 |
+
|
| 439 |
+
Layout, OCR, and table recognition all share a single vision-language model
|
| 440 |
+
(Qwen3.5-style architecture, ~650M params). It's trained on diverse document
|
| 441 |
+
images to emit either a layout JSON or a full-page HTML output, depending on
|
| 442 |
+
prompt. Text-line detection is a separate small torch model — a modified
|
| 443 |
+
EfficientViT segformer trained from scratch on document line annotations.
|
| 444 |
+
|
| 445 |
+
If you want help finetuning Surya on your own data, or to use our managed
|
| 446 |
+
training stack, reach us at hi@datalab.to.
|
| 447 |
+
|
| 448 |
+
# Thanks
|
| 449 |
+
|
| 450 |
+
This work would not have been possible without amazing open source AI work:
|
| 451 |
+
|
| 452 |
+
- [Qwen3-VL](https://huggingface.co/Qwen) from Alibaba
|
| 453 |
+
- [vllm](https://github.com/vllm-project/vllm) and [llama.cpp](https://github.com/ggerganov/llama.cpp) for inference
|
| 454 |
+
- [Segformer](https://arxiv.org/pdf/2105.15203.pdf) from NVIDIA
|
| 455 |
+
- [EfficientViT](https://github.com/mit-han-lab/efficientvit) from MIT
|
| 456 |
+
- [timm](https://github.com/huggingface/pytorch-image-models) from Ross Wightman
|
| 457 |
+
- [transformers](https://github.com/huggingface/transformers) from huggingface
|
| 458 |
+
- [CRAFT](https://github.com/clovaai/CRAFT-pytorch), a great scene text detection model
|
| 459 |
|
| 460 |
+
Thank you to everyone who makes open source AI possible.
|
| 461 |
|
| 462 |
+
# Citation
|
| 463 |
|
| 464 |
+
If you use surya (or the associated models) in your work or research, please consider citing us using the following BibTeX entry:
|
| 465 |
|
| 466 |
+
```bibtex
|
| 467 |
+
@misc{paruchuri2025surya,
|
| 468 |
+
author = {Vikas Paruchuri and Datalab Team},
|
| 469 |
+
title = {Surya: A lightweight document OCR and analysis toolkit},
|
| 470 |
+
year = {2025},
|
| 471 |
+
howpublished = {\url{https://github.com/datalab-to/surya}},
|
| 472 |
+
note = {GitHub repository},
|
| 473 |
+
}
|
corporate.png
ADDED
|
Git LFS Details
|
corporate_layout.png
ADDED
|
Git LFS Details
|
corporate_reading.png
ADDED
|
Git LFS Details
|
corporate_tablerec.png
ADDED
|
Git LFS Details
|
corporate_text.png
ADDED
|
Git LFS Details
|
excerpt_text.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
form.png
ADDED
|
Git LFS Details
|
form_layout.png
ADDED
|
Git LFS Details
|
form_reading.png
ADDED
|
Git LFS Details
|
form_tablerec.png
ADDED
|
Git LFS Details
|
form_text.png
ADDED
|
Git LFS Details
|
handwritten.png
ADDED
|
Git LFS Details
|
handwritten_layout.png
ADDED
|
Git LFS Details
|
handwritten_reading.png
ADDED
|
Git LFS Details
|
handwritten_tablerec.png
ADDED
|
Git LFS Details
|
handwritten_text.png
ADDED
|
Git LFS Details
|
newspaper.png
ADDED
|
Git LFS Details
|
newspaper_layout.png
ADDED
|
Git LFS Details
|
newspaper_reading.png
ADDED
|
Git LFS Details
|
newspaper_text.png
ADDED
|
Git LFS Details
|
olmocr_size_chart.png
CHANGED
|
Git LFS Details
|
|
Git LFS Details
|
textbook.png
ADDED
|
Git LFS Details
|
textbook_layout.png
ADDED
|
Git LFS Details
|
textbook_reading.png
ADDED
|
Git LFS Details
|
textbook_text.png
ADDED
|
Git LFS Details
|