---
license: mit
language:
  - en
  - ja
  - multilingual
tags:
  - core-ai
  - coreai
  - on-device
  - ocr
  - document-ai
  - vision-language
  - apple
pipeline_tag: image-to-text
base_model: baidu/Unlimited-OCR
library_name: coreai
---

# Unlimited-OCR → Core AI (on-device document OCR)

**On-device document → structured-markdown OCR, end-to-end on Apple Core AI.** A port of
[`baidu/Unlimited-OCR`](https://huggingface.co/baidu/Unlimited-OCR) (3B-A0.5B MoE, MIT): drop a
document image, get back **markdown** — tables as HTML (`<table><tr><td>…`), formulas as **LaTeX**,
reading order, and `<|det|>` layout boxes. Japanese + English + multilingual.

Runs on the **stock `coreai.runtime`** with **no engine patch** — the decoder is driven directly
on `inputs_embeds`, so this is a pure-export port (not the static-input-buffer VLM path).

## What's exciting (why you'd use it)

- **Private OCR**: invoices, receipts, contracts, papers, forms never leave the device.
- **Structured, not just text**: tables → HTML, equations → LaTeX, layout → boxes. RAG-ready ingestion.
- **Flat latency**: a static-shape decode graph (data-driven KV write + fixed-buffer R-SWA mask)
  keeps every tensor shape constant, so the runtime compiles once and decode stays **flat at
  ~12.7 ms/token (~79 tok/s on M4 Max)** — no growing-cache recompilation stalls.
- **SOTA quality**: the source model tops OmniDocBench v1.6 (93.92); this port is byte-faithful
  to the fp32 reference (decoder 0 flips at the sampled steps; vision encoder cos 1.000000).

## Bundles

| path | what | dtype | size |
|---|---|---|---|
| `vision/unlimited_ocr_vision.aimodel` | DeepEncoder (SAM-ViT + CLIP-ViT cascade) → 100 visual tokens | fp16 | 762 MB |
| `decoder/unlimited_ocr_decoder.aimodel` | DeepseekV2 R-SWA MoE decoder, functions **`prefill`** + **`decode`** sharing one weight set + KV state | sym8 | 3.2 GB |
| `assets/embed_tokens.f16` | token embedding table `[129280,1280]` (host row-gather) | fp16 | 316 MB |
| `assets/{image_newline,view_seperator}.f16`, `assets/prompt_input_ids.i32`, `assets/recipe.json` | arrangement constants + the assembly recipe | — | tiny |
| `tokenizer/` | fast tokenizer (`tokenizer.json` + configs) | — | — |

## Pipeline (Base mode, 640px)

```
image → preprocess (pad to 640², normalize mean=std=0.5)
      → vision .aimodel                         → visual tokens [1,100,1280]
      → arrange (10×10 + image_newline per row + view_seperator) → [111,1280]
      → scatter into embed_tokens(prompt_ids)   → prefix [1,115,1280]
      → decoder: prefill(prefix) + greedy decode (no_repeat_ngram=35) → tokens
      → detokenize (keep special tokens)        → markdown
```

The exact, verified recipe is in `assets/recipe.json`. Reference implementations (Python end-to-end
+ a macOS app, **CoreAIOCR**, driving the stock runtime) are in the
[Core AI Model Zoo](https://github.com/john-rocky/coreai-model-zoo): `conversion/unlimited_ocr/` and
`apps/CoreAIOCR/`.

## Notes

- **Appropriate input**: clean single-page documents (invoice / paper / report / table / formula),
  roughly square or portrait, with text still legible when fit to 640². Very dense small-text scans
  (newspaper) want the tiled `crop_mode` vision export (not included here; Base mode only).
- Prompt is fixed to `document parsing` (layout + structured extraction).
- License: **MIT** (inherited from `baidu/Unlimited-OCR`).

*Community port — not affiliated with Apple or baidu.*