Unlimited-OCR — GGUF

GGUF quantizations of baidu/Unlimited-OCR, a 3B vision-language OCR model that pushes DeepSeek-OCR one step further (one-shot, long-horizon document parsing). This repo contains a full spread of K-quants and i-quants of the language model plus the vision projector (mmproj) needed for image input.

⚠️ Requires a DeepSeek-OCR–aware llama.cpp build (PR #17400). Unlimited-OCR uses the DeepSeek-OCR architecture (a SAM+CLIP DeepEncoder vision tower with a DeepSeek-V2 MoE text decoder). Support is not yet merged into upstream main — stock llama.cpp will not load these files. Build the PR branch (instructions below).

Files

Every run needs two files: one language model GGUF (pick a quant) plus the shared vision projector. The projector is fp16 and identical for all quants.

File Quant Bits Size Notes
Unlimited-OCR-BF16.gguf BF16 16 5.47 GiB Full-precision conversion. The base every quant is made from; reference quality.
Unlimited-OCR-Q8_0.gguf Q8_0 8 2.91 GiB Near-lossless. Best quality short of BF16; recommended if you have the disk/RAM.
Unlimited-OCR-Q6_K.gguf Q6_K 6 2.43 GiB Very high quality, essentially indistinguishable from Q8_0 for OCR.
Unlimited-OCR-Q5_K_M.gguf Q5_K_M 5 2.07 GiB High quality. Great balance when you can spare a bit more than Q4.
Unlimited-OCR-Q5_K_S.gguf Q5_K_S 5 1.95 GiB High quality, slightly smaller than Q5_K_M.
Unlimited-OCR-Q4_K_M.gguf Q4_K_M 4 1.82 GiB Recommended default — best overall size/quality trade-off.
Unlimited-OCR-Q4_K_S.gguf Q4_K_S 4 1.68 GiB Slightly smaller than Q4_K_M with a small quality cost.
Unlimited-OCR-Q3_K_M.gguf Q3_K_M 3 1.45 GiB Compact. Usable when memory is tight; some quality loss.
Unlimited-OCR-IQ4_XS.gguf IQ4_XS 4 1.53 GiB i-quant: smaller than Q4_K_S at similar quality (built with imatrix).
Unlimited-OCR-IQ4_NL.gguf IQ4_NL 4 1.59 GiB i-quant (non-linear): 4-bit tuned for ARM/edge; good on Jetson/Apple.
Unlimited-OCR-IQ3_M.gguf IQ3_M 3 1.35 GiB i-quant: solid 3-bit quality for the size (imatrix).
Unlimited-OCR-IQ3_XXS.gguf IQ3_XXS 3 1.24 GiB i-quant: very small 3-bit; noticeable quality loss but runnable.
Unlimited-OCR-IQ2_M.gguf IQ2_M 2 1.15 GiB i-quant: smallest here; experimental, lowest quality — for tight memory only.

Vision projector (required for all of the above):

File Type Size
mmproj-Unlimited-OCR-F16.gguf F16 774.27 MiB

Sizes are the on-disk GGUF sizes. The vision encoder is kept at F16 (not quantized) — it is small and quantizing it hurts OCR accuracy. i-quants were built with an importance matrix (imatrix) computed from a general-text calibration set.

Build llama.cpp with DeepSeek-OCR support

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
git fetch origin pull/17400/head:pr17400 && git checkout pr17400
cmake -B build -DCMAKE_BUILD_TYPE=Release        # add -DGGML_CUDA=ON for NVIDIA
cmake --build build -j --target llama-mtmd-cli llama-server

Quick start

Download one quant + the projector (you always need both):

huggingface-cli download sahilchachra/Unlimited-OCR-GGUF \
  --include "Unlimited-OCR-Q4_K_M.gguf" "mmproj-Unlimited-OCR-F16.gguf" --local-dir ./uocr

Run it on an image:

./build/bin/llama-mtmd-cli \
  -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
  --image document.png \
  -p "<|grounding|>Convert the document to markdown." \
  --chat-template deepseek-ocr --temp 0

--chat-template deepseek-ocr and --mmproj are required. With --image, the image is injected automatically — you do not need to type a literal <image> token in -p. Use --temp 0 for OCR (deterministic). Add -n 4096 (or more) for long/dense documents.


Prompting guide

Unlimited-OCR uses the DeepSeek-OCR prompt vocabulary. The prompt is just an instruction; prefix it with <|grounding|> whenever you also want bounding boxes for what was read.

Task Prompt (-p)
Document → Markdown (layout-aware, with boxes) `<
Plain text OCR (just the text, no layout) Free OCR.
OCR with bounding boxes `<
Native Unlimited-OCR parse document parsing.
Parse a figure / chart / diagram Parse the figure.
Describe the image (general VQA) Describe this image in detail.
Find specific text (referring grounding) `<

Worked examples

1) Document → clean Markdown (tables, headings, reading order):

./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf --chat-template deepseek-ocr \
  --image invoice.png --temp 0 -n 4096 \
  -p "<|grounding|>Convert the document to markdown."

2) Just the raw text, no layout / no boxes:

./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf --chat-template deepseek-ocr \
  --image receipt.jpg --temp 0 -p "Free OCR."

3) Locate a specific string and get its box:

./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf --chat-template deepseek-ocr \
  --image form.png --temp 0 \
  -p "<|grounding|>Locate <|ref|>Invoice Number<|/ref|> in the image."

Understanding the output (grounding tokens)

With <|grounding|>, the model interleaves the recognized text with detection boxes:

<|det|>title [37, 64, 464, 132]<|/det|>INVOICE #2026-0623
<|det|>text  [37, 194, 350, 247]<|/det|>Bill To: Sahil Chachra
<|det|>text  [37, 483, 329, 543]<|/det|>Total Due: $44.00

Each [x1, y1, x2, y2] is the bounding box (top-left → bottom-right) of that span, in the coordinate space of the model's input image. Drop the <|det|>...<|/det|> tags if you only want the text, or parse them to overlay boxes / build a layout. Without <|grounding|> you get plain text (or Markdown) with no box tags.

Tip — long documents: Unlimited-OCR targets one-shot long-horizon parsing. For multi-page scans, run page-by-page and concatenate. If output ever repeats/loops on a dense page, add a mild repetition penalty, e.g. --repeat-penalty 1.05, and keep --temp 0.


Serving (OpenAI-compatible API)

./build/bin/llama-server \
  -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
  --chat-template deepseek-ocr -c 8192 --host 0.0.0.0 --port 8080

Call it with an image (base64 data URL):

IMG=$(base64 -w0 document.png)
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "temperature": 0,
  "messages": [{ "role": "user", "content": [
    { "type": "text", "text": "<|grounding|>Convert the document to markdown." },
    { "type": "image_url", "image_url": { "url": "data:image/png;base64,'"$IMG"'" } }
  ]}]
}'

Python (OpenAI SDK) is identical — point base_url at http://localhost:8080/v1, send a text part with the prompt above and an image_url part with the data URL.

About the model

  • Architecture: DeepseekOCRForCausalLMDeepEncoder vision (SAM-ViT-B + CLIP-L/14, 1024×1024 input, 16× downsample) → linear projector → DeepSeek-V2 MoE text decoder (12 layers, hidden 1280, 64 routed + 2 shared experts, 6 experts/token).
  • Task: multilingual OCR / document parsing — single image, multi-page, and PDF (one-shot long-horizon parsing). The original supports gundam (crop) and base resolution modes.
  • License: MIT (inherited from the base model).

How these were made

  1. Converted baidu/Unlimited-OCR to GGUF with the PR #17400 convert_hf_to_gguf.py. The converter targets DeepSeek-OCR, so the config's top-level architectures was set to DeepseekOCRForCausalLM and language_config.architectures to DeepseekV2ForCausalLM (the model is otherwise byte-identical to DeepSeek-OCR's tensor layout).
  2. Exported the text decoder (BF16) and the vision tower (--mmproj, F16) separately.
  3. Built an importance matrix from a general-text corpus and produced the K-/i-quants with llama-quantize.
  4. Verified: the BF16 GGUF + mmproj correctly OCR a test document (text + grounding boxes) via llama-mtmd-cli before quantizing.

Limitations

  • Needs the PR #17400 llama.cpp build until DeepSeek-OCR support lands in main.
  • Very low-bit i-quants (IQ3_XXS, IQ2_M) trade real accuracy for size — prefer Q4_K_M or higher for production OCR.
  • The vision encoder runs in fp16 regardless of the chosen text quant.

Credits

Downloads last month
7,356
GGUF
Model size
3B params
Architecture
deepseek2-ocr
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sahilchachra/Unlimited-OCR-GGUF

Quantized
(10)
this model

Collection including sahilchachra/Unlimited-OCR-GGUF