Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,122 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
tags:
|
| 6 |
+
- text-recognition
|
| 7 |
+
- ocr
|
| 8 |
+
- trocr
|
| 9 |
+
- gguf
|
| 10 |
+
- crispembed
|
| 11 |
+
base_model: microsoft/trocr-small-printed
|
| 12 |
+
pipeline_tag: image-to-text
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# TrOCR-small Printed Text — GGUF
|
| 16 |
+
|
| 17 |
+
Text recognition model for [CrispEmbed](https://github.com/CrispStrobe/CrispEmbed).
|
| 18 |
+
Recognizes printed text from cropped text-line images (output of a text detector like DBNet).
|
| 19 |
+
|
| 20 |
+
**Architecture**: DeiT-small encoder (12L, 384d, 6 heads) + TrOCR decoder
|
| 21 |
+
(6L, 256d, 8 heads). XLM-R vocabulary (64,044 tokens).
|
| 22 |
+
|
| 23 |
+
**Source**: [microsoft/trocr-small-printed](https://huggingface.co/microsoft/trocr-small-printed) (MIT license).
|
| 24 |
+
|
| 25 |
+
## Model Variants
|
| 26 |
+
|
| 27 |
+
| Variant | Size | Recognition quality | Notes |
|
| 28 |
+
|---------|------|-------------------|-------|
|
| 29 |
+
| F32 | 235 MB | reference | exact match vs HuggingFace |
|
| 30 |
+
| F16 | 118 MB | identical | exact same tokens |
|
| 31 |
+
| **Q8_0** | **63 MB** | **identical** | exact same tokens, **recommended** |
|
| 32 |
+
| Q4_K | 42 MB | degraded | some errors on short words (256d too narrow for Q4_K) |
|
| 33 |
+
|
| 34 |
+
**Recommended: Q8_0** (63 MB). Q4_K degrades quality on this small model due to
|
| 35 |
+
the 256-dim bottleneck being too narrow for aggressive 4-bit quantization.
|
| 36 |
+
|
| 37 |
+
### Test Results
|
| 38 |
+
|
| 39 |
+
| Input | F32 | F16 | Q8_0 | Q4_K |
|
| 40 |
+
|-------|-----|-----|------|------|
|
| 41 |
+
| "Hello World" | HELLO WORLD | HELLO WORLD | HELLO WORLD | HELOWORING |
|
| 42 |
+
| "The quick brown fox" | THE QUICK BROWN FOX | THE QUICK BROWN FOX | THE QUICK BROWN FOX | THE QUICK BROWN FOX |
|
| 43 |
+
| "42 is the answer" | 42 IS THE ANSWER | 42 IS THE ANSWER | 42 IS THE ANSWER | 42 IS THE ANSWER |
|
| 44 |
+
|
| 45 |
+
Note: trocr-small-printed uppercases output (training data bias). For mixed-case
|
| 46 |
+
output, use trocr-base-printed or trocr-small-handwritten.
|
| 47 |
+
|
| 48 |
+
## Usage
|
| 49 |
+
|
| 50 |
+
### CLI (with DBNet detection)
|
| 51 |
+
```bash
|
| 52 |
+
# Full OCR pipeline
|
| 53 |
+
crispembed --det dbnet-ic15-q4_k.gguf -m trocr-small-printed-q8_0.gguf --ocr document.png
|
| 54 |
+
|
| 55 |
+
# JSON output
|
| 56 |
+
crispembed --det dbnet-ic15-q4_k.gguf -m trocr-small-printed-q8_0.gguf --ocr document.png --json
|
| 57 |
+
```
|
| 58 |
+
|
| 59 |
+
### C API
|
| 60 |
+
```c
|
| 61 |
+
#include "crispembed.h"
|
| 62 |
+
|
| 63 |
+
// Full pipeline (detect + recognize)
|
| 64 |
+
void *ctx = crispembed_ocr_init("dbnet-ic15-q4_k.gguf",
|
| 65 |
+
"trocr-small-printed-q8_0.gguf", 4);
|
| 66 |
+
int n;
|
| 67 |
+
const crispembed_ocr_result *r = crispembed_ocr(ctx, "image.png", &n);
|
| 68 |
+
for (int i = 0; i < n; i++)
|
| 69 |
+
printf("%s ", r[i].text);
|
| 70 |
+
crispembed_ocr_free(ctx);
|
| 71 |
+
|
| 72 |
+
// Recognition only (pre-cropped text)
|
| 73 |
+
const char *text = crispembed_ocr_recognize(ctx, "crop.png", NULL);
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
### Pipeline Performance
|
| 77 |
+
|
| 78 |
+
Tested on 5-line document image (800x600, 21 word-level text regions):
|
| 79 |
+
|
| 80 |
+
| Detection | Recognition | Total pipeline | Total size |
|
| 81 |
+
|-----------|-------------|---------------|------------|
|
| 82 |
+
| Q4_K (7 MB) | Q8_0 (63 MB) | ~4s (21 regions) | **70 MB** |
|
| 83 |
+
| Q4_K (7 MB) | F32 (235 MB) | ~4s (21 regions) | 242 MB |
|
| 84 |
+
| F32 (46 MB) | Q8_0 (63 MB) | ~4s (21 regions) | 109 MB |
|
| 85 |
+
|
| 86 |
+
~200ms per text region (encoder + decoder), CPU-only, 4 threads.
|
| 87 |
+
|
| 88 |
+
## Conversion
|
| 89 |
+
|
| 90 |
+
```bash
|
| 91 |
+
# Download from HuggingFace
|
| 92 |
+
pip install transformers sentencepiece
|
| 93 |
+
python -c "from huggingface_hub import snapshot_download; snapshot_download('microsoft/trocr-small-printed', local_dir='trocr-small-printed')"
|
| 94 |
+
|
| 95 |
+
# Convert to GGUF (includes XLM-R tokenizer via AutoTokenizer)
|
| 96 |
+
python models/convert-trocr-to-gguf.py \
|
| 97 |
+
--model-dir trocr-small-printed/ \
|
| 98 |
+
--output trocr-small-printed-f32.gguf
|
| 99 |
+
|
| 100 |
+
# Quantize
|
| 101 |
+
crispembed-quantize trocr-small-printed-f32.gguf trocr-small-printed-q8_0.gguf q8_0
|
| 102 |
+
```
|
| 103 |
+
|
| 104 |
+
## Architecture Details
|
| 105 |
+
|
| 106 |
+
```
|
| 107 |
+
Input: cropped text image (resized to 384x384 grayscale)
|
| 108 |
+
|
|
| 109 |
+
+-> DeiT-small encoder (12 layers)
|
| 110 |
+
| Patch embedding: 16x16 patches -> 576 tokens + CLS + distillation
|
| 111 |
+
| 12x: Pre-LN -> MHA (6 heads) -> FFN (GELU) -> residual
|
| 112 |
+
| Output: 578 x 384 encoder features
|
| 113 |
+
|
|
| 114 |
+
+-> TrOCR decoder (6 layers, autoregressive)
|
| 115 |
+
Token + position embedding (64044 vocab, 514 positions)
|
| 116 |
+
6x: Self-attn (causal) -> Cross-attn (enc features) -> FFN
|
| 117 |
+
-> logits -> greedy argmax -> BPE detokenize
|
| 118 |
+
```
|
| 119 |
+
|
| 120 |
+
## License
|
| 121 |
+
|
| 122 |
+
MIT (same as microsoft/trocr-small-printed source model).
|