--- license: mit language: - en - ja - multilingual tags: - core-ai - coreai - on-device - ocr - document-ai - vision-language - apple pipeline_tag: image-to-text base_model: baidu/Unlimited-OCR library_name: coreai --- # Unlimited-OCR → Core AI (on-device document OCR) **On-device document → structured-markdown OCR, end-to-end on Apple Core AI.** A port of [`baidu/Unlimited-OCR`](https://huggingface.co/baidu/Unlimited-OCR) (3B-A0.5B MoE, MIT): drop a document image, get back **markdown** — tables as HTML (`
…`), formulas as **LaTeX**, reading order, and `<|det|>` layout boxes. Japanese + English + multilingual. Runs on the **stock `coreai.runtime`** with **no engine patch** — the decoder is driven directly on `inputs_embeds`, so this is a pure-export port (not the static-input-buffer VLM path). ## What's exciting (why you'd use it) - **Private OCR**: invoices, receipts, contracts, papers, forms never leave the device. - **Structured, not just text**: tables → HTML, equations → LaTeX, layout → boxes. RAG-ready ingestion. - **Flat latency**: a static-shape decode graph (data-driven KV write + fixed-buffer R-SWA mask) keeps every tensor shape constant, so the runtime compiles once and decode stays **flat at ~12.7 ms/token (~79 tok/s on M4 Max)** — no growing-cache recompilation stalls. - **SOTA quality**: the source model tops OmniDocBench v1.6 (93.92); this port is byte-faithful to the fp32 reference (decoder 0 flips at the sampled steps; vision encoder cos 1.000000). ## Bundles | path | what | dtype | size | |---|---|---|---| | `vision/unlimited_ocr_vision.aimodel` | DeepEncoder (SAM-ViT + CLIP-ViT cascade) → 100 visual tokens | fp16 | 762 MB | | `decoder/unlimited_ocr_decoder.aimodel` | DeepseekV2 R-SWA MoE decoder, functions **`prefill`** + **`decode`** sharing one weight set + KV state | sym8 | 3.2 GB | | `assets/embed_tokens.f16` | token embedding table `[129280,1280]` (host row-gather) | fp16 | 316 MB | | `assets/{image_newline,view_seperator}.f16`, `assets/prompt_input_ids.i32`, `assets/recipe.json` | arrangement constants + the assembly recipe | — | tiny | | `tokenizer/` | fast tokenizer (`tokenizer.json` + configs) | — | — | ## Pipeline (Base mode, 640px) ``` image → preprocess (pad to 640², normalize mean=std=0.5) → vision .aimodel → visual tokens [1,100,1280] → arrange (10×10 + image_newline per row + view_seperator) → [111,1280] → scatter into embed_tokens(prompt_ids) → prefix [1,115,1280] → decoder: prefill(prefix) + greedy decode (no_repeat_ngram=35) → tokens → detokenize (keep special tokens) → markdown ``` The exact, verified recipe is in `assets/recipe.json`. Reference implementations (Python end-to-end + a macOS app, **CoreAIOCR**, driving the stock runtime) are in the [Core AI Model Zoo](https://github.com/john-rocky/coreai-model-zoo): `conversion/unlimited_ocr/` and `apps/CoreAIOCR/`. ## Notes - **Appropriate input**: clean single-page documents (invoice / paper / report / table / formula), roughly square or portrait, with text still legible when fit to 640². Very dense small-text scans (newspaper) want the tiled `crop_mode` vision export (not included here; Base mode only). - Prompt is fixed to `document parsing` (layout + structured extraction). - License: **MIT** (inherited from `baidu/Unlimited-OCR`). *Community port — not affiliated with Apple or baidu.*