cstr
/

trocr-small-printed-GGUF

+---
+license: mit
+language:
+- en
+tags:
+- text-recognition
+- ocr
+- trocr
+- gguf
+- crispembed
+base_model: microsoft/trocr-small-printed
+pipeline_tag: image-to-text
+---
+# TrOCR-small Printed Text — GGUF
+Text recognition model for [CrispEmbed](https://github.com/CrispStrobe/CrispEmbed).
+Recognizes printed text from cropped text-line images (output of a text detector like DBNet).
+**Architecture**: DeiT-small encoder (12L, 384d, 6 heads) + TrOCR decoder
+(6L, 256d, 8 heads). XLM-R vocabulary (64,044 tokens).
+**Source**: [microsoft/trocr-small-printed](https://huggingface.co/microsoft/trocr-small-printed) (MIT license).
+## Model Variants
+| Variant | Size | Recognition quality | Notes |
+|---------|------|-------------------|-------|
+| F32 | 235 MB | reference | exact match vs HuggingFace |
+| F16 | 118 MB | identical | exact same tokens |
+| **Q8_0** | **63 MB** | **identical** | exact same tokens, **recommended** |
+| Q4_K | 42 MB | degraded | some errors on short words (256d too narrow for Q4_K) |
+**Recommended: Q8_0** (63 MB). Q4_K degrades quality on this small model due to
+the 256-dim bottleneck being too narrow for aggressive 4-bit quantization.
+### Test Results
+| Input | F32 | F16 | Q8_0 | Q4_K |
+|-------|-----|-----|------|------|
+| "Hello World" | HELLO WORLD | HELLO WORLD | HELLO WORLD | HELOWORING |
+| "The quick brown fox" | THE QUICK BROWN FOX | THE QUICK BROWN FOX | THE QUICK BROWN FOX | THE QUICK BROWN FOX |
+| "42 is the answer" | 42 IS THE ANSWER | 42 IS THE ANSWER | 42 IS THE ANSWER | 42 IS THE ANSWER |
+Note: trocr-small-printed uppercases output (training data bias). For mixed-case
+output, use trocr-base-printed or trocr-small-handwritten.
+## Usage
+### CLI (with DBNet detection)
+```bash
+# Full OCR pipeline
+crispembed --det dbnet-ic15-q4_k.gguf -m trocr-small-printed-q8_0.gguf --ocr document.png
+# JSON output
+crispembed --det dbnet-ic15-q4_k.gguf -m trocr-small-printed-q8_0.gguf --ocr document.png --json
+```
+### C API
+```c
+#include "crispembed.h"
+// Full pipeline (detect + recognize)
+void *ctx = crispembed_ocr_init("dbnet-ic15-q4_k.gguf",
+                                 "trocr-small-printed-q8_0.gguf", 4);
+int n;
+const crispembed_ocr_result *r = crispembed_ocr(ctx, "image.png", &n);
+for (int i = 0; i < n; i++)
+    printf("%s ", r[i].text);
+crispembed_ocr_free(ctx);
+// Recognition only (pre-cropped text)
+const char *text = crispembed_ocr_recognize(ctx, "crop.png", NULL);
+```
+### Pipeline Performance
+Tested on 5-line document image (800x600, 21 word-level text regions):
+| Detection | Recognition | Total pipeline | Total size |
+|-----------|-------------|---------------|------------|
+| Q4_K (7 MB) | Q8_0 (63 MB) | ~4s (21 regions) | **70 MB** |
+| Q4_K (7 MB) | F32 (235 MB) | ~4s (21 regions) | 242 MB |
+| F32 (46 MB) | Q8_0 (63 MB) | ~4s (21 regions) | 109 MB |
+~200ms per text region (encoder + decoder), CPU-only, 4 threads.
+## Conversion
+```bash
+# Download from HuggingFace
+pip install transformers sentencepiece
+python -c "from huggingface_hub import snapshot_download; snapshot_download('microsoft/trocr-small-printed', local_dir='trocr-small-printed')"
+# Convert to GGUF (includes XLM-R tokenizer via AutoTokenizer)
+python models/convert-trocr-to-gguf.py \
+    --model-dir trocr-small-printed/ \
+    --output trocr-small-printed-f32.gguf
+# Quantize
+crispembed-quantize trocr-small-printed-f32.gguf trocr-small-printed-q8_0.gguf q8_0
+```
+## Architecture Details
+```
+Input: cropped text image (resized to 384x384 grayscale)
+  |
+  +-> DeiT-small encoder (12 layers)
+  |     Patch embedding: 16x16 patches -> 576 tokens + CLS + distillation
+  |     12x: Pre-LN -> MHA (6 heads) -> FFN (GELU) -> residual
+  |     Output: 578 x 384 encoder features
+  |
+  +-> TrOCR decoder (6 layers, autoregressive)
+        Token + position embedding (64044 vocab, 514 positions)
+        6x: Self-attn (causal) -> Cross-attn (enc features) -> FFN
+        -> logits -> greedy argmax -> BPE detokenize
+```
+## License
+MIT (same as microsoft/trocr-small-printed source model).