TrOCR-small Printed Text โ GGUF
Text recognition model for CrispEmbed. Recognizes printed text from cropped text-line images. Pair with a text detector like cstr/dbnet-ic15-GGUF for end-to-end OCR.
Architecture: DeiT-small encoder (12L, 384d, 6 heads) + TrOCR decoder (6L, 256d, 8 heads). XLM-R vocabulary (64,044 tokens). 61M parameters.
Source: microsoft/trocr-small-printed (MIT).
Model Variants
| Variant | Size | Recognition quality |
|---|---|---|
| F32 | 235 MB | exact match vs HuggingFace |
| F16 | 119 MB | exact same tokens |
| Q8_0 | 65 MB | exact same tokens |
Recommended: Q8_0 (65 MB). Q4_K is not provided โ the 256-dim decoder bottleneck is too narrow for 4-bit quantization, causing recognition errors.
Verification (all variants produce identical output)
| Input image | Output |
|---|---|
| "Hello World" | HELLO WORLD |
| "The quick brown fox" | THE QUICK BROWN FOX |
| "42 is the answer" | 42 IS THE ANSWER |
Note: trocr-small-printed uppercases output (training data bias). For mixed-case, use a trocr-base model.
Usage
Full OCR pipeline (with DBNet)
crispembed --det dbnet-ic15-q4_k.gguf \
-m trocr-small-printed-q8_0.gguf \
--ocr document.png
Output:
[ 0] (49,53)-(143,86) conf=0.91 "HELLO"
[ 1] (153,52)-(270,86) conf=0.91 "WORLD!"
[ 2] (50,122)-(124,157) conf=0.91 "THIS"
...
C API
#include "crispembed.h"
void *ctx = crispembed_ocr_init("dbnet-ic15-q4_k.gguf",
"trocr-small-printed-q8_0.gguf", 4);
int n;
const crispembed_ocr_result *r = crispembed_ocr(ctx, "document.png", &n);
for (int i = 0; i < n; i++)
printf("%s ", r[i].text);
crispembed_ocr_free(ctx);
Pipeline size
| Detection | Recognition | Total | Throughput |
|---|---|---|---|
| Q4_K (7 MB) | Q8_0 (65 MB) | 72 MB | ~200ms/region |
Architecture
Input: text crop (resized to 384x384, grayscale)
|
+-> DeiT-small encoder (12 layers)
| 16x16 patch embedding -> 576+2 tokens (CLS + distillation)
| 12x Pre-LN MHA (6 heads, 384d) + FFN (GELU, 1536d)
|
+-> TrOCR decoder (6 layers, autoregressive)
Token + position embedding (64044 BPE vocab, 514 max positions)
6x Self-attn (causal) + Cross-attn + FFN
-> greedy argmax -> SentencePiece BPE detokenize
XLM-R SentencePiece tokenizer with fairseq vocab offset. Word boundaries
marked by โ (U+2581), converted to spaces at decode time.
Conversion
pip install gguf numpy transformers sentencepiece safetensors
# Download model
python -c "from huggingface_hub import snapshot_download; \
snapshot_download('microsoft/trocr-small-printed', local_dir='trocr-small-printed')"
# Convert (embeds XLM-R tokenizer via AutoTokenizer)
python models/convert-trocr-to-gguf.py \
--model-dir trocr-small-printed/ \
--output trocr-small-printed-f32.gguf
# Quantize (Q8_0 recommended; Q4_K degrades this model)
crispembed-quantize trocr-small-printed-f32.gguf trocr-small-printed-q8_0.gguf q8_0
License
MIT (same as microsoft/trocr-small-printed).
- Downloads last month
- 22
8-bit
16-bit
32-bit
Model tree for cstr/trocr-small-printed-GGUF
Base model
microsoft/trocr-small-printed