vikp commited on
Commit
41f0836
·
verified ·
1 Parent(s): 053b963

Sync README + screenshots + chart; bump license threshold to $5M

Browse files
.gitattributes CHANGED
@@ -38,3 +38,26 @@ excerpt_text.png filter=lfs diff=lfs merge=lfs -text
38
  excerpt_layout.png filter=lfs diff=lfs merge=lfs -text
39
  scanned_tablerec.png filter=lfs diff=lfs merge=lfs -text
40
  olmocr_size_chart.png filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  excerpt_layout.png filter=lfs diff=lfs merge=lfs -text
39
  scanned_tablerec.png filter=lfs diff=lfs merge=lfs -text
40
  olmocr_size_chart.png filter=lfs diff=lfs merge=lfs -text
41
+ newspaper.png filter=lfs diff=lfs merge=lfs -text
42
+ newspaper_text.png filter=lfs diff=lfs merge=lfs -text
43
+ newspaper_layout.png filter=lfs diff=lfs merge=lfs -text
44
+ newspaper_reading.png filter=lfs diff=lfs merge=lfs -text
45
+ textbook.png filter=lfs diff=lfs merge=lfs -text
46
+ textbook_text.png filter=lfs diff=lfs merge=lfs -text
47
+ textbook_layout.png filter=lfs diff=lfs merge=lfs -text
48
+ textbook_reading.png filter=lfs diff=lfs merge=lfs -text
49
+ form.png filter=lfs diff=lfs merge=lfs -text
50
+ form_text.png filter=lfs diff=lfs merge=lfs -text
51
+ form_layout.png filter=lfs diff=lfs merge=lfs -text
52
+ form_reading.png filter=lfs diff=lfs merge=lfs -text
53
+ form_tablerec.png filter=lfs diff=lfs merge=lfs -text
54
+ handwritten.png filter=lfs diff=lfs merge=lfs -text
55
+ handwritten_text.png filter=lfs diff=lfs merge=lfs -text
56
+ handwritten_layout.png filter=lfs diff=lfs merge=lfs -text
57
+ handwritten_reading.png filter=lfs diff=lfs merge=lfs -text
58
+ handwritten_tablerec.png filter=lfs diff=lfs merge=lfs -text
59
+ corporate.png filter=lfs diff=lfs merge=lfs -text
60
+ corporate_text.png filter=lfs diff=lfs merge=lfs -text
61
+ corporate_layout.png filter=lfs diff=lfs merge=lfs -text
62
+ corporate_reading.png filter=lfs diff=lfs merge=lfs -text
63
+ corporate_tablerec.png filter=lfs diff=lfs merge=lfs -text
LICENSE CHANGED
@@ -53,7 +53,7 @@ As conditions to the Licenses set forth in this Agreement, You agree not to use,
53
  (a) In any way that violates any applicable national, federal, state, local or international law or regulation; or
54
  (b) to directly or indirectly infringe or misappropriate any third party intellectual property rights (including those of Licensor or any Contributor)
55
  2. Commercial:
56
- (a) for any purpose if You (your employer, or the entity you are affiliated with) generated more than two million US Dollars ($2,000,000) in gross revenue in the prior year, except where Your Use is limited to personal use or research purposes;
57
- (b) for any purpose if You (your employer, or the entity you are affiliated with) has raised more than two million US dollars ($2,000,000) in total equity or debt funding from any source, except where Your Use is limited to personal use or research purposes; or
58
  (c) for any purpose if You (your employer, or the entity you are affiliated with) provides or otherwise makes available any product or service that competes with any product or service offered by or made available by Licensor or any of its affiliates.
59
  Commercial and broader use licenses may be available from Licensor at the following URL: https://www.datalab.to/
 
53
  (a) In any way that violates any applicable national, federal, state, local or international law or regulation; or
54
  (b) to directly or indirectly infringe or misappropriate any third party intellectual property rights (including those of Licensor or any Contributor)
55
  2. Commercial:
56
+ (a) for any purpose if You (your employer, or the entity you are affiliated with) generated more than five million US Dollars ($5,000,000) in gross revenue in the prior year, except where Your Use is limited to personal use or research purposes;
57
+ (b) for any purpose if You (your employer, or the entity you are affiliated with) has raised more than five million US dollars ($5,000,000) in total equity or debt funding from any source, except where Your Use is limited to personal use or research purposes; or
58
  (c) for any purpose if You (your employer, or the entity you are affiliated with) provides or otherwise makes available any product or service that competes with any product or service offered by or made available by Licensor or any of its affiliates.
59
  Commercial and broader use licenses may be available from Licensor at the following URL: https://www.datalab.to/
README.md CHANGED
@@ -1,143 +1,473 @@
1
- ---
2
- library_name: transformers
3
- license: openrail
4
- license_link: LICENSE
5
- pipeline_tag: image-text-to-text
6
- tags:
7
- - ocr
8
- - pdf
9
- - layout
10
- - table
11
- - document-intelligence
12
- ---
 
 
13
 
14
- # Surya OCR 2
15
 
16
- Surya 2 is a 690M-parameter open-source document OCR model from [Datalab](https://www.datalab.to) that does full-page OCR with layout, line-level text detection, layout analysis with reading order, and table recognition — all from a single VLM.
17
 
18
- It ranks near the top of [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench), competitive with models 5–50× larger.
19
 
20
- Try Surya in the [free playground](https://www.datalab.to/playground), or use the [hosted API](https://www.datalab.to/) for higher throughput and uptime.
 
 
 
 
21
 
22
- ## Benchmarks
23
 
24
- ### olmOCR-bench
 
 
 
 
 
 
 
 
25
 
26
  <img src="olmocr_size_chart.png" width="700"/>
27
 
28
- | Model | Params | Score |
29
- |-----------------------------|----------:|---------:|
30
- | Infinity-Parser2-Pro | 35.1B | 87.6 |
31
- | Chandra OCR 2 (Datalab) | 5.3B | 85.9 |
32
- | dots.mocr | 3.0B | 83.9 |
33
- | LightOnOCR 2-1B \* | 1.0B | 83.2 |
34
- | **Surya OCR 2** (Datalab) | **0.69B** | **83.1** |
35
- | Chandra OCR 1 (Datalab) | 9.0B | 83.1 |
36
- | olmOCR (anchored) | 8.3B | 77.4 |
37
- | GOT OCR | 0.6B | 48.3 |
38
 
39
- \* LightOnOCR 2-1B uses a different evaluation methodology than the other entries (see their [release notes](https://huggingface.co/lightonai/LightOnOCR-2-1B)); included for context but not directly comparable.
 
 
40
 
41
- Per-source pass rate on the olmOCR-bench `default` preset (8,413 tests total):
 
 
42
 
43
- | ArXiv | Base | Hdr/Ftr | TinyTxt | MultCol | OldScan | OldMath | Tables |
44
- |------:|-----:|--------:|--------:|--------:|--------:|--------:|-------:|
45
- | 88.7 | 99.9 | 92.1 | 86.4 | 82.6 | 42.8 | 85.8 | 86.6 |
46
 
47
- ## Features
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
- - Full-page OCR with layout in a single VLM call per page
50
- - Line-level text detection (separate small torch model)
51
- - Layout analysis (Caption / Section-Header / Table / Equation / etc.) with reading order
52
- - Table recognition: rows + columns (simple mode) or full `<table>` HTML with spanning cells (full mode)
53
- - Inline math in `<math>…</math>` tags (KaTeX-compatible LaTeX) — no separate LaTeX OCR pass
54
- - Two backends: `vllm` for NVIDIA GPUs, `llama.cpp` for Apple Silicon / CPU
55
 
56
- | | |
57
- |:---:|:---:|
58
- | <img src="excerpt.png" width="320"/> | <img src="excerpt_text.png" width="320"/> |
59
- | <img src="excerpt_layout.png" width="320"/> | <img src="scanned_tablerec.png" width="320"/> |
60
 
61
- ## Quickstart
 
 
62
 
63
  ```shell
64
  pip install surya-ocr
65
- surya_ocr path/to/document.pdf # writes results.json with layout + text per page
66
  ```
67
 
68
- Or run the interactive Streamlit demo:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
  ```shell
 
71
  surya_gui
72
  ```
73
 
74
- The first inference call auto-spawns the backend (`vllm` Docker on NVIDIA hosts, `llama-server` on Apple Silicon / CPU) and pulls the model weights from this repo. Subsequent calls reuse the running server.
75
 
76
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
  ```python
79
  from PIL import Image
80
  from surya.inference import SuryaInferenceManager
81
  from surya.recognition import RecognitionPredictor
82
 
83
- manager = SuryaInferenceManager() # auto-spawns vllm or llama-server
84
- rec = RecognitionPredictor(manager)
 
 
 
 
85
 
86
- results = rec([Image.open("page.png")]) # full-page OCR
87
- for blk in results[0].blocks:
88
- print(blk.label, blk.html[:80])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
  ```
90
 
91
- Block-mode OCR (one VLM call per layout block) is auto-selected when you pass a `LayoutResult`:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
 
93
  ```python
 
 
94
  from surya.layout import LayoutPredictor
95
- layout = LayoutPredictor(manager)
96
- layouts = layout([Image.open("page.png")])
97
- results = rec([Image.open("page.png")], layouts)
98
  ```
99
 
100
- Table recognition:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
101
 
102
  ```python
 
 
103
  from surya.table_rec import TableRecPredictor
104
- table = TableRecPredictor(manager)
105
 
106
- table_predictions = table([Image.open("page.png")]) # rows + columns only
107
- table_predictions = table.predict_full([Image.open("page.png")]) # full <table> HTML
 
 
 
 
 
108
  ```
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
  ## Throughput
111
 
112
- Full-page OCR, 96 DPI input (~2,400 output tokens/page average), measured client-side against a running inference server.
 
113
 
114
  ### RTX 5090 (vllm)
115
 
116
- `vllm/vllm-openai:v0.20.1`, single RTX 5090 (32 GB). Sustained power ~478 W (80% of 600 W TDP) across all concurrencies.
117
-
118
- | Concurrency | Pages/s | Tokens/s | p50 (ms) | p95 (ms) | avg tok/page |
119
- |---:|---:|---:|---:|---:|---:|
120
- | 32 | 3.67 | 8,870 | 6,744 | 21,741 | 2,420 |
121
- | 64 | 4.67 | 11,280 | 10,741 | 34,639 | 2,414 |
122
- | **128** | **5.35** | **12,884** | 18,915 | 42,538 | 2,410 |
123
 
124
- Throughput climbs from conc=32 128 but latency grows faster than capacity. Pick conc=64 for the latency/throughput knee, conc=128 for max throughput.
 
 
125
 
126
  ### Apple Silicon (llama.cpp / Metal)
127
 
128
  `llama-server` with Metal backend.
129
 
130
- | `--parallel` | Pages/s | Tokens/s | p50 (ms) | p95 (ms) | avg tok/page | Power |
131
- |---:|---:|---:|---:|---:|---:|---:|
132
- | **8** | **0.108** | **254** | 59,313 | 129,173 | 2,360 | ~30 W |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
133
 
134
- ## Commercial Usage
135
 
136
- Code is Apache 2.0. Model weights use a modified OpenRAIL-M license — free for research, personal use, and startups under $2M funding/revenue. Cannot be used competitively with the Datalab API. For broader commercial licensing, see [pricing](https://www.datalab.to/pricing?utm_source=hf-surya).
137
 
138
- ## Credits
139
 
140
- - [Hugging Face Transformers](https://github.com/huggingface/transformers)
141
- - [vLLM](https://github.com/vllm-project/vllm) and [llama.cpp](https://github.com/ggerganov/llama.cpp)
142
- - [Qwen 3.5](https://github.com/QwenLM/Qwen3) (architecture basis)
143
- - [olmOCR](https://github.com/allenai/olmocr) (benchmark + harness)
 
 
 
 
 
1
+ <h1 align="center">Datalab</h1>
2
+ <p align="center">
3
+ <strong>State of the Art models for Document Intelligence</strong>
4
+ </p>
5
+ <p align="center">
6
+ <a href="https://www.apache.org/licenses/LICENSE-2.0"><img src="https://img.shields.io/badge/Code%20License-Apache--2.0-green.svg" alt="Code License"></a>
7
+ <a href="https://www.datalab.to/pricing"><img src="https://img.shields.io/badge/Model%20License-OpenRAIL--M-blue.svg" alt="Model License"></a>
8
+ <a href="https://discord.gg/KuZwXNGnfH"><img src="https://img.shields.io/badge/Discord-Join%20us-5865F2?logo=discord&logoColor=white" alt="Discord"></a>
9
+ </p>
10
+ <p align="center">
11
+ <a href="https://www.datalab.to"><img src="https://img.shields.io/badge/Homepage-datalab.to-blue" alt="Homepage"></a>
12
+ <a href="https://documentation.datalab.to"><img src="https://img.shields.io/badge/Docs-Read%20the%20docs-blue" alt="Docs"></a>
13
+ <a href="https://www.datalab.to/playground"><img src="https://img.shields.io/badge/Datalab Playground-Try%20it-orange" alt="Datalab Playground"></a>
14
+ </p>
15
 
16
+ <hr/>
17
 
18
+ # Surya
19
 
20
+ Surya is an OCR toolkit powered by a 650M param model that does:
21
 
22
+ - Full-page OCR, scoring 83.3% on [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench) (top under 3B params)
23
+ - Multilingual OCR - scores 87.2% on an internal benchmark set of 91 languages (more [here](#multilingual))
24
+ - Line-level text detection
25
+ - Layout analysis (table, image, header, etc.) with reading order
26
+ - Table recognition (rows + columns)
27
 
28
+ It works on a range of documents (see [usage](#usage) and [benchmarks](#benchmarks)).
29
 
30
+ ## Try Datalab's Managed Platform
31
+
32
+ Our managed platform runs both Surya, and variants of our highest accuracy model, [Chandra](https://github.com/datalab-to/chandra).
33
+
34
+ Get started with **$5 in free credits** — [sign up](https://www.datalab.to/?utm_source=gh-surya) (takes under 30 seconds) or try our free [public playground](https://www.datalab.to/playground?utm_source=gh-surya).
35
+
36
+ Commercial self-hosting of the model weights requires a license — see [Commercial usage](#commercial-usage). For on-prem licensing, [contact us](https://www.datalab.to/contact?utm_source=gh-surya-onprem). If you have high volume workloads, we offer a batch processing service that can process 1B+ pages per week.
37
+
38
+ ## Model Information
39
 
40
  <img src="olmocr_size_chart.png" width="700"/>
41
 
 
 
 
 
 
 
 
 
 
 
42
 
43
+ | Detection | OCR |
44
+ |:----------------------------------------------------------------:|:-----------------------------------------------------------------------:|
45
+ | <img src="excerpt.png" width="280"/> | <img src="excerpt_text.png" width="280"/> |
46
 
47
+ | Layout | Table Recognition |
48
+ |:------------------------------------------------------------------:|:-------------------------------------------------------------:|
49
+ | <img src="excerpt_layout.png" width="280"/> | <img src="scanned_tablerec.png" width="280"/> |
50
 
 
 
 
51
 
52
+ Surya is named for the [Hindu sun god](https://en.wikipedia.org/wiki/Surya), who has universal vision.
53
+
54
+ ## Examples
55
+
56
+ Each row links to five annotated views of the same page: text-line detection, OCR, layout, reading order, and (when present) table recognition.
57
+
58
+ | Name | Detection | OCR | Layout | Order | Table Rec |
59
+ |-------------------|:-----------------------------------:|------------------------------------------:|---------------------------------------------:|------------------------------------------------:|------------------------------------------------:|
60
+ | Newspaper | [Image](newspaper.png) | [Image](newspaper_text.png) | [Image](newspaper_layout.png) | [Image](newspaper_reading.png) | |
61
+ | Textbook | [Image](textbook.png) | [Image](textbook_text.png) | [Image](textbook_layout.png) | [Image](textbook_reading.png) | |
62
+ | Tax Form | [Image](form.png) | [Image](form_text.png) | [Image](form_layout.png) | [Image](form_reading.png) | [Image](form_tablerec.png) |
63
+ | Handwritten Notes | [Image](handwritten.png) | [Image](handwritten_text.png) | [Image](handwritten_layout.png) | [Image](handwritten_reading.png) | [Image](handwritten_tablerec.png) |
64
+ | Corporate Doc | [Image](corporate.png) | [Image](corporate_text.png) | [Image](corporate_layout.png) | [Image](corporate_reading.png) | [Image](corporate_tablerec.png) |
65
 
66
+ # Commercial usage
 
 
 
 
 
67
 
68
+ The Surya code is licensed under Apache 2.0. The model weights use a modified AI Pubs Open Rail-M license (free for research, personal use, and startups under $5M funding/revenue). For broader commercial licensing of the model weights, visit our pricing page [here](https://www.datalab.to/pricing?utm_source=gh-surya).
 
 
 
69
 
70
+ # Installation
71
+
72
+ Install with:
73
 
74
  ```shell
75
  pip install surya-ocr
 
76
  ```
77
 
78
+ ## Upgrading from Surya v1
79
+
80
+ If you have v1 code, you can migrate to this:
81
+
82
+ ```python
83
+ # v2
84
+ from surya.inference import SuryaInferenceManager
85
+ from surya.recognition import RecognitionPredictor
86
+
87
+ manager = SuryaInferenceManager() # auto-spawns vllm or llama-server
88
+ rec = RecognitionPredictor(manager)
89
+ predictions = rec([image])
90
+ ```
91
+
92
+ What's different:
93
+ - `SuryaInferenceManager` replaces `FoundationPredictor`. Same manager instance is shared across `LayoutPredictor`, `RecognitionPredictor`, `TableRecPredictor`.
94
+ - Output schemas changed: see the per-section JSON tables below. Highlights — `text_lines` → `blocks` (with `html`); layout dropped `top_k`, added `count`; table_rec dropped `is_header` / `colspan` / `rowspan` from cells.
95
+
96
+ # Usage
97
+
98
+ Surya 2 runs layout, OCR, and table recognition through a single VLM served
99
+ by `vllm` (GPU) or `llama.cpp` (CPU / Apple Silicon). The inference manager
100
+ will spawn one for you on first use; you can also point it at an existing
101
+ server via `SURYA_INFERENCE_URL=http://host:port/v1`.
102
+
103
+ - Inspect the settings in `surya/settings.py`. You can override any setting via env var (e.g. `SURYA_INFERENCE_BACKEND=vllm`).
104
+ - Text detection and OCR errors are separate models.
105
+
106
+ ## Interactive App
107
+
108
+ I've included a streamlit app that lets you interactively try Surya on images or PDF files. Run it with:
109
 
110
  ```shell
111
+ pip install streamlit pdftext
112
  surya_gui
113
  ```
114
 
115
+ ## OCR (text recognition)
116
 
117
+ This command will write out a json file with the detected text and bboxes:
118
+
119
+ ```shell
120
+ surya_ocr DATA_PATH
121
+ ```
122
+
123
+ - `DATA_PATH` can be an image, pdf, or folder of images/pdfs
124
+ - `--images` will save images of the pages and detected blocks (optional)
125
+ - `--output_dir` specifies the directory to save results to instead of the default
126
+ - `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
127
+
128
+ The `results.json` file contains a dict keyed by input filename (no extension). Each value is a list of page dicts. Each page dict contains:
129
+
130
+ - `blocks` - per-block OCR results in reading order
131
+ - `label` - canonicalized layout label (e.g. `Text`, `SectionHeader`, `Table`, `Equation`, `Picture`, `Form`, `PageHeader`, ...). See `surya/layout/label.py:LAYOUT_PRED_RELABEL` for the full canonical-name set.
132
+ - `raw_label` - original label emitted by the model, before canonicalization
133
+ - `reading_order` - 0-indexed position in layout output
134
+ - `html` - block content as HTML (math wrapped in `<math>...</math>`, tables as `<table>...</table>`, etc.). `""` if the block was skipped
135
+ - `polygon` - 4-corner polygon in `[[x0,y0],[x1,y0],[x1,y1],[x0,y1]]` order
136
+ - `bbox` - axis-aligned `[x0, y0, x1, y1]` derived from the polygon
137
+ - `confidence` - mean per-token probability across the block's decode (0-1)
138
+ - `skipped` - true if the block was a visual label (e.g. Picture) and not OCR'd
139
+ - `error` - true if the block OCR call failed
140
+ - `image_bbox` - `[0, 0, width, height]` for the page image
141
+
142
+ **Performance tips**
143
+
144
+ Throughput is governed by the inference backend, not a `RECOGNITION_BATCH_SIZE` env var. With `vllm`, raise `--max-num-seqs` / `--max-num-batched-tokens` (or `SURYA_INFERENCE_PARALLEL` on the client side) to keep more pages in flight. With `llama.cpp`, set `SURYA_INFERENCE_PARALLEL` to match `--parallel` on `llama-server`.
145
+
146
+ ### From python
147
 
148
  ```python
149
  from PIL import Image
150
  from surya.inference import SuryaInferenceManager
151
  from surya.recognition import RecognitionPredictor
152
 
153
+ manager = SuryaInferenceManager()
154
+ recognition_predictor = RecognitionPredictor(manager)
155
+
156
+ # Default: full-page OCR. One VLM call per page; returns layout + content as
157
+ # HTML <div data-bbox=... data-label=...> blocks.
158
+ predictions = recognition_predictor([Image.open(IMAGE_PATH)])
159
 
160
+ # Block mode: pre-run layout, then per-block OCR. Auto-selected when
161
+ # `layout_results` is passed.
162
+ from surya.layout import LayoutPredictor
163
+ layout = LayoutPredictor(manager)
164
+ layouts = layout([Image.open(IMAGE_PATH)])
165
+ predictions = recognition_predictor([Image.open(IMAGE_PATH)], layouts)
166
+ ```
167
+
168
+
169
+ ## Text line detection
170
+
171
+ This command will write out a json file with the detected bboxes.
172
+
173
+ ```shell
174
+ surya_detect DATA_PATH
175
+ ```
176
+
177
+ - `DATA_PATH` can be an image, pdf, or folder of images/pdfs
178
+ - `--images` will save images of the pages and detected text lines (optional)
179
+ - `--output_dir` specifies the directory to save results to instead of the default
180
+ - `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
181
+
182
+ The `results.json` file will contain a json dictionary where the keys are the input filenames without extensions. Each value will be a list of dictionaries, one per page of the input document. Each page dictionary contains:
183
+
184
+ - `bboxes` - detected bounding boxes for text
185
+ - `bbox` - the axis-aligned rectangle for the text line in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner.
186
+ - `polygon` - the polygon for the text line in (x1, y1), (x2, y2), (x3, y3), (x4, y4) format. The points are in clockwise order from the top left.
187
+ - `confidence` - the confidence of the model in the detected text (0-1)
188
+ - `vertical_lines` - vertical lines detected in the document
189
+ - `bbox` - the axis-aligned line coordinates.
190
+ - `page` - the page number in the file
191
+ - `image_bbox` - the bbox for the image in (x1, y1, x2, y2) format. (x1, y1) is the top left corner, and (x2, y2) is the bottom right corner. All line bboxes will be contained within this bbox.
192
+
193
+ **Performance tips**
194
+
195
+ Detection is a torch model. `DETECTOR_BATCH_SIZE` defaults to an auto-picked value at runtime; override the env var to control VRAM usage on GPU and raise it on larger cards.
196
+
197
+ ### From python
198
+
199
+ ```python
200
+ from PIL import Image
201
+ from surya.detection import DetectionPredictor
202
+
203
+ det_predictor = DetectionPredictor()
204
+ predictions = det_predictor([Image.open(IMAGE_PATH)])
205
+ ```
206
+
207
+ ## Layout and reading order
208
+
209
+ This command will write out a json file with the detected layout and reading order.
210
+
211
+ ```shell
212
+ surya_layout DATA_PATH
213
  ```
214
 
215
+ - `DATA_PATH` can be an image, pdf, or folder of images/pdfs
216
+ - `--images` will save images of the pages and detected text lines (optional)
217
+ - `--output_dir` specifies the directory to save results to instead of the default
218
+ - `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
219
+
220
+ The `results.json` file contains a dict keyed by input filename (no extension). Each value is a list of page dicts. Each page dict contains:
221
+
222
+ - `bboxes` - layout boxes in reading order
223
+ - `polygon` - 4-corner polygon `[[x0,y0],[x1,y0],[x1,y1],[x0,y1]]`
224
+ - `bbox` - axis-aligned `[x0, y0, x1, y1]` derived from the polygon
225
+ - `label` - canonicalized label. One of `Caption`, `Footnote`, `Equation`, `ListGroup`, `PageHeader`, `PageFooter`, `Picture`, `SectionHeader`, `Table`, `Text`, `Figure`, `Code`, `Form`, `TableOfContents`, `ChemicalBlock`, `Diagram`, `Bibliography`, `BlankPage`
226
+ - `raw_label` - original label emitted by the model
227
+ - `position` - 0-indexed reading order
228
+ - `count` - model's token estimate for OCR'ing this block (rounded to multiples of 50; used to size the per-block decode budget)
229
+ - `confidence` - mean per-token probability across the layout decode (0-1)
230
+ - `image_bbox` - `[0, 0, width, height]`
231
+ - `raw` - raw JSON the layout model emitted, for debugging
232
+ - `error` - true if the layout call failed
233
+
234
+ **Performance tips**
235
+
236
+ Layout runs through the shared inference backend. Throughput tuning is the same as OCR — see Performance tips above.
237
+
238
+ ### From python
239
 
240
  ```python
241
+ from PIL import Image
242
+ from surya.inference import SuryaInferenceManager
243
  from surya.layout import LayoutPredictor
244
+
245
+ layout_predictor = LayoutPredictor(SuryaInferenceManager())
246
+ layout_predictions = layout_predictor([Image.open(IMAGE_PATH)])
247
  ```
248
 
249
+ ## Table Recognition
250
+
251
+ This command will write out a json file with the detected table cells and row/column ids, along with row/column bounding boxes. If you want to get cell positions and text, along with nice formatting, check out the [marker](https://github.com/datalab-to/marker) repo. You can use the `TableConverter` to detect and extract tables in images and PDFs. It supports output in json (with bboxes), markdown, and html.
252
+
253
+ ```shell
254
+ surya_table DATA_PATH
255
+ ```
256
+
257
+ - `DATA_PATH` can be an image, pdf, or folder of images/pdfs
258
+ - `--images` will save annotated row + column overlays alongside the json (optional)
259
+ - `--output_dir` specifies the directory to save results to instead of the default
260
+ - `--page_range` specifies the page range to process in the PDF, specified as a single number, a comma separated list, a range, or comma separated ranges - example: `0,5-10,20`.
261
+ - `--skip_table_detection` tells table recognition not to detect tables first. Use this if your image is already cropped to a table.
262
+
263
+ The `results.json` file contains a dict keyed by input filename (no extension). Each value is a list of per-table dicts. Each table dict contains:
264
+
265
+ - `rows` - detected table rows in reading order
266
+ - `polygon` / `bbox` - row geometry (same convention as everywhere else)
267
+ - `row_id` - 0-indexed row id
268
+ - `cols` - detected table columns
269
+ - `polygon` / `bbox` - column geometry
270
+ - `col_id` - 0-indexed column id
271
+ - `cells` - geometric row × column intersections (simple mode)
272
+ - `polygon` / `bbox` - cell geometry
273
+ - `row_id`, `col_id`, `cell_id`
274
+ - `html` - full `<table>...</table>` HTML (only populated when `predict_full` is used; handles spanning cells / header rows). `null` in simple mode.
275
+ - `mode` - `"simple"` or `"full"`
276
+ - `image_bbox` - the table crop bbox
277
+ - `error` - true if the table_rec call failed
278
+ - `raw` - raw model output, for debugging
279
+
280
+ **Performance tips**
281
+
282
+ Table recognition routes through the shared VLM. Throughput tuning is the same as OCR.
283
+
284
+ ### From python
285
 
286
  ```python
287
+ from PIL import Image
288
+ from surya.inference import SuryaInferenceManager
289
  from surya.table_rec import TableRecPredictor
 
290
 
291
+ table_rec_predictor = TableRecPredictor(SuryaInferenceManager())
292
+
293
+ # Default: rows + columns only, cells derived from intersections.
294
+ table_predictions = table_rec_predictor([Image.open(IMAGE_PATH)])
295
+
296
+ # Or full HTML output (better for spanning cells / headers):
297
+ # table_predictions = table_rec_predictor.predict_full([image])
298
  ```
299
 
300
+ ## Math / equations
301
+
302
+ Surya 2 handles math inline as part of full-page OCR — recognized equations
303
+ come back inside `<math>...</math>` tags in the same HTML output as
304
+ surrounding prose, in KaTeX-compatible LaTeX. No separate LaTeX OCR pass.
305
+
306
+ # Inference Backends
307
+
308
+ Layout / OCR / table_rec all share one VLM, served either by `vllm` (GPU) or `llama.cpp` (CPU / Apple Silicon). The `SuryaInferenceManager` will spawn one automatically; you can also point at a pre-running server:
309
+
310
+ ```bash
311
+ # Attach to an existing vllm
312
+ export SURYA_INFERENCE_BACKEND=vllm
313
+ export SURYA_INFERENCE_URL=http://localhost:8000/v1
314
+ ```
315
+
316
+ | Setting | Default | Notes |
317
+ |-----------------------------------|-----------------------------------|--------------------------------------------------------|
318
+ | `SURYA_INFERENCE_BACKEND` | auto (vllm if NVIDIA, else llamacpp) | `vllm` \| `llamacpp` \| unset (auto) |
319
+ | `SURYA_INFERENCE_URL` | (auto-spawn) | Attach to a running OpenAI-compatible server |
320
+ | `SURYA_INFERENCE_PARALLEL` | 8 | Client-side concurrency to the backend |
321
+ | `SURYA_GUIDED_LAYOUT` | true | JSON-schema-constrained layout decode |
322
+
323
+ # Limitations
324
+
325
+ - This is specialized for document OCR. Performance on photos or natural scenes is not the goal.
326
+ - Layout / OCR / table_rec all need a running inference backend (vllm or llama.cpp). Detection runs purely on torch and works without it.
327
+
328
+ ## Troubleshooting
329
+
330
+ If OCR isn't working properly:
331
+
332
+ - Try increasing resolution of the image so the text is bigger. If the resolution is already very high, try decreasing it to no more than a `2048px` width.
333
+ - Preprocessing the image (binarizing, deskewing, etc) can help with very old/blurry images.
334
+ - You can adjust `DETECTOR_BLANK_THRESHOLD` and `DETECTOR_TEXT_THRESHOLD` if you don't get good results. `DETECTOR_BLANK_THRESHOLD` controls the space between lines - any prediction below this number will be considered blank space. `DETECTOR_TEXT_THRESHOLD` controls how text is joined - any number above this is considered text. `DETECTOR_TEXT_THRESHOLD` should always be higher than `DETECTOR_BLANK_THRESHOLD`, and both should be in the 0-1 range. Looking at the heatmap from the debug output of the detector can tell you how to adjust these (if you see faint things that look like boxes, lower the thresholds, and if you see bboxes being joined together, raise the thresholds).
335
+
336
+ # Manual install
337
+
338
+ If you want to develop surya, you can install it manually with [uv](https://docs.astral.sh/uv/):
339
+
340
+ ```bash
341
+ git clone https://github.com/datalab-to/surya.git
342
+ cd surya
343
+ uv sync --group dev # installs runtime + dev deps
344
+ uv run surya_ocr ... # or `uv shell` to enter the venv
345
+ ```
346
+
347
+ # Benchmarks
348
+
349
+ Surya 2 is a single VLM that handles layout analysis, OCR (full-page or
350
+ per-block), and table recognition in one model. We evaluate end-to-end on
351
+ [olmOCR-bench](https://huggingface.co/datasets/allenai/olmOCR-bench) — the
352
+ standard quality benchmark for document parsers.
353
+
354
+ ## olmOCR-bench
355
+
356
+ Best-in-class accuracy under 1B parameters; pareto-optimal vs every model 3B and below.
357
+
358
+ | Model | Params | Score |
359
+ |-----------------------------|----------:|---------:|
360
+ | Infinity-Parser2-Pro | 35.1B | 87.6 |
361
+ | Chandra OCR 2 (Datalab) | 5.3B | 85.9 |
362
+ | dots.mocr | 3.0B | 83.9 |
363
+ | **Surya OCR 2** (Datalab) | **0.65B** | **83.3** |
364
+ | LightOnOCR 2-1B \* | 1.0B | 83.2 |
365
+ | Chandra OCR 1 (Datalab) | 9.0B | 83.1 |
366
+ | olmOCR (anchored) | 8.3B | 77.4 |
367
+ | GOT OCR | 0.6B | 48.3 |
368
+
369
+ \* **LightOnOCR 2-1B** uses a different benchmark methodology than the other entries (see their [release notes](https://huggingface.co/lightonai/LightOnOCR-2-1B)); the score is included for context but is not directly comparable.
370
+
371
+ Comparison scores from the [olmOCR-bench dataset card](https://huggingface.co/datasets/allenai/olmOCR-bench).
372
+
373
+ Surya 2, per-source pass rate on the `default` preset (8,413 tests total):
374
+
375
+ | ArXiv | Base | Hdr/Ftr | TinyTxt | MultCol | OldScan | OldMath | Tables |
376
+ |------:|-----:|--------:|--------:|--------:|--------:|--------:|-------:|
377
+ | 88.3 | 99.7 | 92.5 | 93.7 | 82.4 | 41.8 | 81.4 | 86.6 |
378
+
379
+ ## Multilingual
380
+
381
+ We also evaluate Surya 2 against a 91-language internal benchmark covering
382
+ text accuracy, layout, tables, math, and reading order in documents drawn
383
+ from each language.
384
+
385
+ **Overall pass rate: 87.2% across 91 languages.** 38 of the
386
+ 91 languages score ≥ 90%; 76 score ≥ 80%.
387
+
388
+ Top 15 widely-spoken languages:
389
+
390
+ | Code | Language | Score |
391
+ |------|-------------|------:|
392
+ | `ar` | Arabic | 72.7% |
393
+ | `bn` | Bengali | 82.7% |
394
+ | `zh` | Chinese | 82.5% |
395
+ | `en` | English | 92.3% |
396
+ | `fr` | French | 89.3% |
397
+ | `de` | German | 89.7% |
398
+ | `hi` | Hindi | 82.2% |
399
+ | `it` | Italian | 93.0% |
400
+ | `ja` | Japanese | 86.2% |
401
+ | `ko` | Korean | 86.7% |
402
+ | `fa` | Persian | 82.3% |
403
+ | `pt` | Portuguese | 86.1% |
404
+ | `ru` | Russian | 88.8% |
405
+ | `es` | Spanish | 90.7% |
406
+ | `vi` | Vietnamese | 73.2% |
407
+
408
+ See [https://github.com/datalab-to/surya/blob/main/static/docs/multilingual.md](https://github.com/datalab-to/surya/blob/main/static/docs/multilingual.md) for the full 91-language table.
409
+
410
  ## Throughput
411
 
412
+ Full-page OCR, 96 DPI input (~2,400 output tokens/page average), measured
413
+ client-side against a running inference server.
414
 
415
  ### RTX 5090 (vllm)
416
 
417
+ `vllm/vllm-openai:v0.20.1`, single RTX 5090 (32 GB).
 
 
 
 
 
 
418
 
419
+ | Concurrency | Pages/s | Tokens/s | p50 (ms) | p95 (ms) | avg tok/page |
420
+ |------------:|--------:|----------:|---:|---:|---:|
421
+ | 128 | 5.35 | 12,884 | 18,915 | 42,538 | 2,410 |
422
 
423
  ### Apple Silicon (llama.cpp / Metal)
424
 
425
  `llama-server` with Metal backend.
426
 
427
+ | `--parallel` | Pages/s | Tokens/s | p50 (ms) | p95 (ms) | avg tok/page | Power |
428
+ |-------------:|---------:|---------:|---:|---:|---:|---:|
429
+ | 8 | 0.108 | 254 | 59,313 | 129,173 | 2,360 | ~30 W |
430
+
431
+ ## Reproducing
432
+
433
+ We score Surya 2 on olmOCR-bench by serving the model with `vllm` (or
434
+ `llama.cpp`) and running the olmOCR-bench harness from
435
+ [allenai/olmocr](https://github.com/allenai/olmocr), with some adjustments applied to account for our output HTML format.
436
+
437
+ # Training
438
+
439
+ Layout, OCR, and table recognition all share a single vision-language model
440
+ (Qwen3.5-style architecture, ~650M params). It's trained on diverse document
441
+ images to emit either a layout JSON or a full-page HTML output, depending on
442
+ prompt. Text-line detection is a separate small torch model — a modified
443
+ EfficientViT segformer trained from scratch on document line annotations.
444
+
445
+ If you want help finetuning Surya on your own data, or to use our managed
446
+ training stack, reach us at hi@datalab.to.
447
+
448
+ # Thanks
449
+
450
+ This work would not have been possible without amazing open source AI work:
451
+
452
+ - [Qwen3-VL](https://huggingface.co/Qwen) from Alibaba
453
+ - [vllm](https://github.com/vllm-project/vllm) and [llama.cpp](https://github.com/ggerganov/llama.cpp) for inference
454
+ - [Segformer](https://arxiv.org/pdf/2105.15203.pdf) from NVIDIA
455
+ - [EfficientViT](https://github.com/mit-han-lab/efficientvit) from MIT
456
+ - [timm](https://github.com/huggingface/pytorch-image-models) from Ross Wightman
457
+ - [transformers](https://github.com/huggingface/transformers) from huggingface
458
+ - [CRAFT](https://github.com/clovaai/CRAFT-pytorch), a great scene text detection model
459
 
460
+ Thank you to everyone who makes open source AI possible.
461
 
462
+ # Citation
463
 
464
+ If you use surya (or the associated models) in your work or research, please consider citing us using the following BibTeX entry:
465
 
466
+ ```bibtex
467
+ @misc{paruchuri2025surya,
468
+ author = {Vikas Paruchuri and Datalab Team},
469
+ title = {Surya: A lightweight document OCR and analysis toolkit},
470
+ year = {2025},
471
+ howpublished = {\url{https://github.com/datalab-to/surya}},
472
+ note = {GitHub repository},
473
+ }
corporate.png ADDED

Git LFS Details

  • SHA256: 03e5004c5ee8b24c09d81c6b735ff057038d58e3e87d09a5aac68ce1fcd09249
  • Pointer size: 131 Bytes
  • Size of remote file: 168 kB
corporate_layout.png ADDED

Git LFS Details

  • SHA256: c472d16817b87a322f40b31f3689209d729d56e82418558b0185060dfe3d5a2b
  • Pointer size: 131 Bytes
  • Size of remote file: 167 kB
corporate_reading.png ADDED

Git LFS Details

  • SHA256: fa9471e99827a94aa4d6889e05880ff920cca195d8ddd66192e02242fcdedf30
  • Pointer size: 131 Bytes
  • Size of remote file: 168 kB
corporate_tablerec.png ADDED

Git LFS Details

  • SHA256: c1861b87e15f50bad581b46e82b9a38f85b74900d1fdb9088fe780698b57a493
  • Pointer size: 131 Bytes
  • Size of remote file: 163 kB
corporate_text.png ADDED

Git LFS Details

  • SHA256: 893c5924fe66740b77a73afeb0d41f4139814f6cbaed2850e4743793b380b99c
  • Pointer size: 131 Bytes
  • Size of remote file: 166 kB
excerpt_text.png CHANGED

Git LFS Details

  • SHA256: f9beaf7a5c18856da5cfb9d0d614db08d0db1442ede416258a51a1573555eeae
  • Pointer size: 131 Bytes
  • Size of remote file: 337 kB

Git LFS Details

  • SHA256: baddec3a08b69949d737f6ec3322bb70a37cf9acb6681c91ff41a9c2f1f23965
  • Pointer size: 131 Bytes
  • Size of remote file: 543 kB
form.png ADDED

Git LFS Details

  • SHA256: 60fb6bfde4d790cf5c97c11bc8697746e3a49fd6004325d8c25aa3ce1600edc5
  • Pointer size: 131 Bytes
  • Size of remote file: 513 kB
form_layout.png ADDED

Git LFS Details

  • SHA256: 631c09dc21bd05a1a81d981fd33ad6f5d0830d3e750bfae189a355fe4509dff9
  • Pointer size: 131 Bytes
  • Size of remote file: 508 kB
form_reading.png ADDED

Git LFS Details

  • SHA256: e88bac7fb15d375be5d4f115b3558007398952b3bd4cd03ebd84a567be6867a9
  • Pointer size: 131 Bytes
  • Size of remote file: 519 kB
form_tablerec.png ADDED

Git LFS Details

  • SHA256: d799c78d495c19cf6e68e73b48459a16331eb80bc569540e17471f29ab4a7a4f
  • Pointer size: 131 Bytes
  • Size of remote file: 511 kB
form_text.png ADDED

Git LFS Details

  • SHA256: 7c40b723392652b65ab18be779bcb9b6396e65f718b9bdea1071e26e543c67b1
  • Pointer size: 131 Bytes
  • Size of remote file: 330 kB
handwritten.png ADDED

Git LFS Details

  • SHA256: 3c7623a26db006f332d7db8e5c0e21861311352622502698abe8b689c8ab3421
  • Pointer size: 131 Bytes
  • Size of remote file: 178 kB
handwritten_layout.png ADDED

Git LFS Details

  • SHA256: ad0e4ae387b843bda64b51e132b1e5ea8003b5ad49aeddafe9681a0968cd51d7
  • Pointer size: 131 Bytes
  • Size of remote file: 185 kB
handwritten_reading.png ADDED

Git LFS Details

  • SHA256: 5cc6662224cacbb0a26c7890aedffeffa7d12a4592115737c7104fa661b058ac
  • Pointer size: 131 Bytes
  • Size of remote file: 185 kB
handwritten_tablerec.png ADDED

Git LFS Details

  • SHA256: 5e3f7820dc76b4480afe350fa4b7263b87e15615109cae40b3c591d0b5be5785
  • Pointer size: 131 Bytes
  • Size of remote file: 171 kB
handwritten_text.png ADDED

Git LFS Details

  • SHA256: 9686cfab491a74085e70957093bd381b6202a1311269ebd18e210982cd2cf4ab
  • Pointer size: 131 Bytes
  • Size of remote file: 298 kB
newspaper.png ADDED

Git LFS Details

  • SHA256: 1a07a43797b78ffa8db5b4d11e7c2668d58d38a94cfa81a6cdd8d50750b62b9e
  • Pointer size: 132 Bytes
  • Size of remote file: 5.65 MB
newspaper_layout.png ADDED

Git LFS Details

  • SHA256: 139c8fd411526b85d2ff600a1912874ae5f23f6d77c1ec3d9319b73a64dd2e7e
  • Pointer size: 132 Bytes
  • Size of remote file: 5.6 MB
newspaper_reading.png ADDED

Git LFS Details

  • SHA256: c18c2eb0c39de94267f11bd97091b6542f130dd05a93142a5e5f007b9a435253
  • Pointer size: 132 Bytes
  • Size of remote file: 5.65 MB
newspaper_text.png ADDED

Git LFS Details

  • SHA256: 364de91c602c902faf13c33628192efbeb1a28fc1b853265563f7946fddbb271
  • Pointer size: 132 Bytes
  • Size of remote file: 1.88 MB
olmocr_size_chart.png CHANGED

Git LFS Details

  • SHA256: 6a53cfc3b014526de1c054448059ed6b5516b0efbcc59ee1b0eb52743e1c4c8b
  • Pointer size: 131 Bytes
  • Size of remote file: 106 kB

Git LFS Details

  • SHA256: e0833e1547eb9977383d8d18c9a31375a339aacd0a4ad2ceb8597ba37b48126e
  • Pointer size: 130 Bytes
  • Size of remote file: 81.4 kB
textbook.png ADDED

Git LFS Details

  • SHA256: 0070c7f61aae00201f9764bf4a01d6c8ab301718f9f9c62522b9eb24ae0890f7
  • Pointer size: 131 Bytes
  • Size of remote file: 193 kB
textbook_layout.png ADDED

Git LFS Details

  • SHA256: 2711fcd2183f9397306817c083a9377134647ab11521293732031e0e780571da
  • Pointer size: 131 Bytes
  • Size of remote file: 195 kB
textbook_reading.png ADDED

Git LFS Details

  • SHA256: 81d14c7a6d77872fc21a2743f03b7ce4124f6301c97aa7e6de425b73396d2259
  • Pointer size: 131 Bytes
  • Size of remote file: 201 kB
textbook_text.png ADDED

Git LFS Details

  • SHA256: 7950d981fdf94b499d27db1bfad4e62aaf139ba7be44f88bbe1d411729c10d8f
  • Pointer size: 131 Bytes
  • Size of remote file: 226 kB