--- license: mit language: - en tags: - text-recognition - ocr - trocr - gguf - crispembed base_model: microsoft/trocr-small-printed pipeline_tag: image-to-text --- # TrOCR-small Printed Text — GGUF Text recognition model for [CrispEmbed](https://github.com/CrispStrobe/CrispEmbed). Recognizes printed text from cropped text-line images. Pair with a text detector like [cstr/dbnet-ic15-GGUF](https://huggingface.co/cstr/dbnet-ic15-GGUF) for end-to-end OCR. **Architecture**: DeiT-small encoder (12L, 384d, 6 heads) + TrOCR decoder (6L, 256d, 8 heads). XLM-R vocabulary (64,044 tokens). 61M parameters. **Source**: [microsoft/trocr-small-printed](https://huggingface.co/microsoft/trocr-small-printed) (MIT). ## Model Variants | Variant | Size | Recognition quality | |---------|------|-------------------| | F32 | 235 MB | exact match vs HuggingFace | | F16 | 119 MB | exact same tokens | | **Q8_0** | **65 MB** | **exact same tokens** | **Recommended: Q8_0** (65 MB). Q4_K is not provided — the 256-dim decoder bottleneck is too narrow for 4-bit quantization, causing recognition errors. ### Verification (all variants produce identical output) | Input image | Output | |-------------|--------| | "Hello World" | HELLO WORLD | | "The quick brown fox" | THE QUICK BROWN FOX | | "42 is the answer" | 42 IS THE ANSWER | Note: trocr-small-printed uppercases output (training data bias). For mixed-case, use a trocr-base model. ## Usage ### Full OCR pipeline (with DBNet) ```bash crispembed --det dbnet-ic15-q4_k.gguf \ -m trocr-small-printed-q8_0.gguf \ --ocr document.png ``` Output: ``` [ 0] (49,53)-(143,86) conf=0.91 "HELLO" [ 1] (153,52)-(270,86) conf=0.91 "WORLD!" [ 2] (50,122)-(124,157) conf=0.91 "THIS" ... ``` ### C API ```c #include "crispembed.h" void *ctx = crispembed_ocr_init("dbnet-ic15-q4_k.gguf", "trocr-small-printed-q8_0.gguf", 4); int n; const crispembed_ocr_result *r = crispembed_ocr(ctx, "document.png", &n); for (int i = 0; i < n; i++) printf("%s ", r[i].text); crispembed_ocr_free(ctx); ``` ### Pipeline size | Detection | Recognition | Total | Throughput | |-----------|-------------|-------|-----------| | Q4_K (7 MB) | Q8_0 (65 MB) | **72 MB** | ~200ms/region | ## Architecture ``` Input: text crop (resized to 384x384, grayscale) | +-> DeiT-small encoder (12 layers) | 16x16 patch embedding -> 576+2 tokens (CLS + distillation) | 12x Pre-LN MHA (6 heads, 384d) + FFN (GELU, 1536d) | +-> TrOCR decoder (6 layers, autoregressive) Token + position embedding (64044 BPE vocab, 514 max positions) 6x Self-attn (causal) + Cross-attn + FFN -> greedy argmax -> SentencePiece BPE detokenize ``` XLM-R SentencePiece tokenizer with fairseq vocab offset. Word boundaries marked by `▁` (U+2581), converted to spaces at decode time. ## Conversion ```bash pip install gguf numpy transformers sentencepiece safetensors # Download model python -c "from huggingface_hub import snapshot_download; \ snapshot_download('microsoft/trocr-small-printed', local_dir='trocr-small-printed')" # Convert (embeds XLM-R tokenizer via AutoTokenizer) python models/convert-trocr-to-gguf.py \ --model-dir trocr-small-printed/ \ --output trocr-small-printed-f32.gguf # Quantize (Q8_0 recommended; Q4_K degrades this model) crispembed-quantize trocr-small-printed-f32.gguf trocr-small-printed-q8_0.gguf q8_0 ``` ## License MIT (same as [microsoft/trocr-small-printed](https://huggingface.co/microsoft/trocr-small-printed)).