Instructions to use sahilchachra/Unlimited-OCR-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use sahilchachra/Unlimited-OCR-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="sahilchachra/Unlimited-OCR-GGUF",
	filename="Unlimited-OCR-BF16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use sahilchachra/Unlimited-OCR-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf sahilchachra/Unlimited-OCR-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf sahilchachra/Unlimited-OCR-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf sahilchachra/Unlimited-OCR-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf sahilchachra/Unlimited-OCR-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf sahilchachra/Unlimited-OCR-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf sahilchachra/Unlimited-OCR-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf sahilchachra/Unlimited-OCR-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf sahilchachra/Unlimited-OCR-GGUF:Q4_K_M

Use Docker

docker model run hf.co/sahilchachra/Unlimited-OCR-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use sahilchachra/Unlimited-OCR-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "sahilchachra/Unlimited-OCR-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "sahilchachra/Unlimited-OCR-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/sahilchachra/Unlimited-OCR-GGUF:Q4_K_M

Ollama
How to use sahilchachra/Unlimited-OCR-GGUF with Ollama:
```
ollama run hf.co/sahilchachra/Unlimited-OCR-GGUF:Q4_K_M
```

Unsloth Studio

How to use sahilchachra/Unlimited-OCR-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sahilchachra/Unlimited-OCR-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for sahilchachra/Unlimited-OCR-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for sahilchachra/Unlimited-OCR-GGUF to start chatting

Atomic Chat new
Docker Model Runner
How to use sahilchachra/Unlimited-OCR-GGUF with Docker Model Runner:
```
docker model run hf.co/sahilchachra/Unlimited-OCR-GGUF:Q4_K_M
```

Lemonade

How to use sahilchachra/Unlimited-OCR-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull sahilchachra/Unlimited-OCR-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Unlimited-OCR-GGUF-Q4_K_M

List all available models

lemonade list

Unlimited-OCR — GGUF

GGUF quantizations of baidu/Unlimited-OCR, a 3B vision-language OCR model that pushes DeepSeek-OCR one step further (one-shot, long-horizon document parsing). This repo contains a full spread of K-quants and i-quants of the language model plus the vision projector (mmproj) needed for image input.

⚠️ Requires a DeepSeek-OCR–aware llama.cpp build (PR #17400). Unlimited-OCR uses the DeepSeek-OCR architecture (a SAM+CLIP DeepEncoder vision tower with a DeepSeek-V2 MoE text decoder). Support is not yet merged into upstream main — stock llama.cpp will not load these files. Build the PR branch (instructions below).

Files

Every run needs two files: one language model GGUF (pick a quant) plus the shared vision projector. The projector is fp16 and identical for all quants.

File	Quant	Bits	Size	Notes
`Unlimited-OCR-BF16.gguf`	BF16	16	5.47 GiB	Full-precision conversion. The base every quant is made from; reference quality.
`Unlimited-OCR-Q8_0.gguf`	Q8_0	8	2.91 GiB	Near-lossless. Best quality short of BF16; recommended if you have the disk/RAM.
`Unlimited-OCR-Q6_K.gguf`	Q6_K	6	2.43 GiB	Very high quality, essentially indistinguishable from Q8_0 for OCR.
`Unlimited-OCR-Q5_K_M.gguf`	Q5_K_M	5	2.07 GiB	High quality. Great balance when you can spare a bit more than Q4.
`Unlimited-OCR-Q5_K_S.gguf`	Q5_K_S	5	1.95 GiB	High quality, slightly smaller than Q5_K_M.
`Unlimited-OCR-Q4_K_M.gguf`	Q4_K_M	4	1.82 GiB	Recommended default — best overall size/quality trade-off.
`Unlimited-OCR-Q4_K_S.gguf`	Q4_K_S	4	1.68 GiB	Slightly smaller than Q4_K_M with a small quality cost.
`Unlimited-OCR-Q3_K_M.gguf`	Q3_K_M	3	1.45 GiB	Compact. Usable when memory is tight; some quality loss.
`Unlimited-OCR-IQ4_XS.gguf`	IQ4_XS	4	1.53 GiB	i-quant: smaller than Q4_K_S at similar quality (built with imatrix).
`Unlimited-OCR-IQ4_NL.gguf`	IQ4_NL	4	1.59 GiB	i-quant (non-linear): 4-bit tuned for ARM/edge; good on Jetson/Apple.
`Unlimited-OCR-IQ3_M.gguf`	IQ3_M	3	1.35 GiB	i-quant: solid 3-bit quality for the size (imatrix).
`Unlimited-OCR-IQ3_XXS.gguf`	IQ3_XXS	3	1.24 GiB	i-quant: very small 3-bit; noticeable quality loss but runnable.
`Unlimited-OCR-IQ2_M.gguf`	IQ2_M	2	1.15 GiB	i-quant: smallest here; experimental, lowest quality — for tight memory only.

Vision projector (required for all of the above):

File	Type	Size
`mmproj-Unlimited-OCR-F16.gguf`	F16	774.27 MiB

Sizes are the on-disk GGUF sizes. The vision encoder is kept at F16 (not quantized) — it is small and quantizing it hurts OCR accuracy. i-quants were built with an importance matrix (imatrix) computed from a general-text calibration set.

Build llama.cpp with DeepSeek-OCR support

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
git fetch origin pull/17400/head:pr17400 && git checkout pr17400
cmake -B build -DCMAKE_BUILD_TYPE=Release        # add -DGGML_CUDA=ON for NVIDIA
cmake --build build -j --target llama-mtmd-cli llama-server

Quick start

Download one quant + the projector (you always need both):

huggingface-cli download sahilchachra/Unlimited-OCR-GGUF \
  --include "Unlimited-OCR-Q4_K_M.gguf" "mmproj-Unlimited-OCR-F16.gguf" --local-dir ./uocr

Run it on an image:

./build/bin/llama-mtmd-cli \
  -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
  --image document.png \
  -p "<|grounding|>Convert the document to markdown." \
  --chat-template deepseek-ocr --temp 0

--chat-template deepseek-ocr and --mmproj are required. With --image, the image is injected automatically — you do not need to type a literal <image> token in -p. Use --temp 0 for OCR (deterministic). Add -n 4096 (or more) for long/dense documents.

Prompting guide

Unlimited-OCR uses the DeepSeek-OCR prompt vocabulary. The prompt is just an instruction; prefix it with <|grounding|> whenever you also want bounding boxes for what was read.

Task	Prompt (`-p`)
Document → Markdown (layout-aware, with boxes)	`<
Plain text OCR (just the text, no layout)	`Free OCR.`
OCR with bounding boxes	`<
Native Unlimited-OCR parse	`document parsing.`
Parse a figure / chart / diagram	`Parse the figure.`
Describe the image (general VQA)	`Describe this image in detail.`
Find specific text (referring grounding)	`<

Worked examples

1) Document → clean Markdown (tables, headings, reading order):

./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf --chat-template deepseek-ocr \
  --image invoice.png --temp 0 -n 4096 \
  -p "<|grounding|>Convert the document to markdown."

2) Just the raw text, no layout / no boxes:

./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf --chat-template deepseek-ocr \
  --image receipt.jpg --temp 0 -p "Free OCR."

3) Locate a specific string and get its box:

./build/bin/llama-mtmd-cli -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf --chat-template deepseek-ocr \
  --image form.png --temp 0 \
  -p "<|grounding|>Locate <|ref|>Invoice Number<|/ref|> in the image."

Understanding the output (grounding tokens)

With <|grounding|>, the model interleaves the recognized text with detection boxes:

<|det|>title [37, 64, 464, 132]<|/det|>INVOICE #2026-0623
<|det|>text  [37, 194, 350, 247]<|/det|>Bill To: Sahil Chachra
<|det|>text  [37, 483, 329, 543]<|/det|>Total Due: $44.00

Each [x1, y1, x2, y2] is the bounding box (top-left → bottom-right) of that span, in the coordinate space of the model's input image. Drop the <|det|>...<|/det|> tags if you only want the text, or parse them to overlay boxes / build a layout. Without <|grounding|> you get plain text (or Markdown) with no box tags.

Tip — long documents: Unlimited-OCR targets one-shot long-horizon parsing. For multi-page scans, run page-by-page and concatenate. If output ever repeats/loops on a dense page, add a mild repetition penalty, e.g. --repeat-penalty 1.05, and keep --temp 0.

Serving (OpenAI-compatible API)

./build/bin/llama-server \
  -m ./uocr/Unlimited-OCR-Q4_K_M.gguf \
  --mmproj ./uocr/mmproj-Unlimited-OCR-F16.gguf \
  --chat-template deepseek-ocr -c 8192 --host 0.0.0.0 --port 8080

Call it with an image (base64 data URL):

IMG=$(base64 -w0 document.png)
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "temperature": 0,
  "messages": [{ "role": "user", "content": [
    { "type": "text", "text": "<|grounding|>Convert the document to markdown." },
    { "type": "image_url", "image_url": { "url": "data:image/png;base64,'"$IMG"'" } }
  ]}]
}'

Python (OpenAI SDK) is identical — point base_url at http://localhost:8080/v1, send a text part with the prompt above and an image_url part with the data URL.

About the model

Architecture: DeepseekOCRForCausalLM — DeepEncoder vision (SAM-ViT-B + CLIP-L/14, 1024×1024 input, 16× downsample) → linear projector → DeepSeek-V2 MoE text decoder (12 layers, hidden 1280, 64 routed + 2 shared experts, 6 experts/token).
Task: multilingual OCR / document parsing — single image, multi-page, and PDF (one-shot long-horizon parsing). The original supports gundam (crop) and base resolution modes.
License: MIT (inherited from the base model).

How these were made

Converted baidu/Unlimited-OCR to GGUF with the PR #17400 convert_hf_to_gguf.py. The converter targets DeepSeek-OCR, so the config's top-level architectures was set to DeepseekOCRForCausalLM and language_config.architectures to DeepseekV2ForCausalLM (the model is otherwise byte-identical to DeepSeek-OCR's tensor layout).
Exported the text decoder (BF16) and the vision tower (--mmproj, F16) separately.
Built an importance matrix from a general-text corpus and produced the K-/i-quants with llama-quantize.
Verified: the BF16 GGUF + mmproj correctly OCR a test document (text + grounding boxes) via llama-mtmd-cli before quantizing.

Limitations

Needs the PR #17400 llama.cpp build until DeepSeek-OCR support lands in main.
Very low-bit i-quants (IQ3_XXS, IQ2_M) trade real accuracy for size — prefer Q4_K_M or higher for production OCR.
The vision encoder runs in fp16 regardless of the chosen text quant.

Credits

Base model: baidu/Unlimited-OCR (MIT) — builds on deepseek-ai/DeepSeek-OCR.
GGUF / DeepSeek-OCR llama.cpp support: ggml-org/llama.cpp#17400.
Quantized by sahilchachra.

Downloads last month: 7,356

GGUF

Model size

3B params

Architecture

deepseek2-ocr

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Model tree for sahilchachra/Unlimited-OCR-GGUF

Base model

baidu/Unlimited-OCR

Quantized

(10)

this model

Collection including sahilchachra/Unlimited-OCR-GGUF

Baidu's Unlimited OCR

Collection

7 items • Updated 3 days ago