cstr commited on
Commit
c28e465
·
verified ·
1 Parent(s): 6fe8845

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +122 -0
README.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - text-recognition
7
+ - ocr
8
+ - trocr
9
+ - gguf
10
+ - crispembed
11
+ base_model: microsoft/trocr-small-printed
12
+ pipeline_tag: image-to-text
13
+ ---
14
+
15
+ # TrOCR-small Printed Text — GGUF
16
+
17
+ Text recognition model for [CrispEmbed](https://github.com/CrispStrobe/CrispEmbed).
18
+ Recognizes printed text from cropped text-line images (output of a text detector like DBNet).
19
+
20
+ **Architecture**: DeiT-small encoder (12L, 384d, 6 heads) + TrOCR decoder
21
+ (6L, 256d, 8 heads). XLM-R vocabulary (64,044 tokens).
22
+
23
+ **Source**: [microsoft/trocr-small-printed](https://huggingface.co/microsoft/trocr-small-printed) (MIT license).
24
+
25
+ ## Model Variants
26
+
27
+ | Variant | Size | Recognition quality | Notes |
28
+ |---------|------|-------------------|-------|
29
+ | F32 | 235 MB | reference | exact match vs HuggingFace |
30
+ | F16 | 118 MB | identical | exact same tokens |
31
+ | **Q8_0** | **63 MB** | **identical** | exact same tokens, **recommended** |
32
+ | Q4_K | 42 MB | degraded | some errors on short words (256d too narrow for Q4_K) |
33
+
34
+ **Recommended: Q8_0** (63 MB). Q4_K degrades quality on this small model due to
35
+ the 256-dim bottleneck being too narrow for aggressive 4-bit quantization.
36
+
37
+ ### Test Results
38
+
39
+ | Input | F32 | F16 | Q8_0 | Q4_K |
40
+ |-------|-----|-----|------|------|
41
+ | "Hello World" | HELLO WORLD | HELLO WORLD | HELLO WORLD | HELOWORING |
42
+ | "The quick brown fox" | THE QUICK BROWN FOX | THE QUICK BROWN FOX | THE QUICK BROWN FOX | THE QUICK BROWN FOX |
43
+ | "42 is the answer" | 42 IS THE ANSWER | 42 IS THE ANSWER | 42 IS THE ANSWER | 42 IS THE ANSWER |
44
+
45
+ Note: trocr-small-printed uppercases output (training data bias). For mixed-case
46
+ output, use trocr-base-printed or trocr-small-handwritten.
47
+
48
+ ## Usage
49
+
50
+ ### CLI (with DBNet detection)
51
+ ```bash
52
+ # Full OCR pipeline
53
+ crispembed --det dbnet-ic15-q4_k.gguf -m trocr-small-printed-q8_0.gguf --ocr document.png
54
+
55
+ # JSON output
56
+ crispembed --det dbnet-ic15-q4_k.gguf -m trocr-small-printed-q8_0.gguf --ocr document.png --json
57
+ ```
58
+
59
+ ### C API
60
+ ```c
61
+ #include "crispembed.h"
62
+
63
+ // Full pipeline (detect + recognize)
64
+ void *ctx = crispembed_ocr_init("dbnet-ic15-q4_k.gguf",
65
+ "trocr-small-printed-q8_0.gguf", 4);
66
+ int n;
67
+ const crispembed_ocr_result *r = crispembed_ocr(ctx, "image.png", &n);
68
+ for (int i = 0; i < n; i++)
69
+ printf("%s ", r[i].text);
70
+ crispembed_ocr_free(ctx);
71
+
72
+ // Recognition only (pre-cropped text)
73
+ const char *text = crispembed_ocr_recognize(ctx, "crop.png", NULL);
74
+ ```
75
+
76
+ ### Pipeline Performance
77
+
78
+ Tested on 5-line document image (800x600, 21 word-level text regions):
79
+
80
+ | Detection | Recognition | Total pipeline | Total size |
81
+ |-----------|-------------|---------------|------------|
82
+ | Q4_K (7 MB) | Q8_0 (63 MB) | ~4s (21 regions) | **70 MB** |
83
+ | Q4_K (7 MB) | F32 (235 MB) | ~4s (21 regions) | 242 MB |
84
+ | F32 (46 MB) | Q8_0 (63 MB) | ~4s (21 regions) | 109 MB |
85
+
86
+ ~200ms per text region (encoder + decoder), CPU-only, 4 threads.
87
+
88
+ ## Conversion
89
+
90
+ ```bash
91
+ # Download from HuggingFace
92
+ pip install transformers sentencepiece
93
+ python -c "from huggingface_hub import snapshot_download; snapshot_download('microsoft/trocr-small-printed', local_dir='trocr-small-printed')"
94
+
95
+ # Convert to GGUF (includes XLM-R tokenizer via AutoTokenizer)
96
+ python models/convert-trocr-to-gguf.py \
97
+ --model-dir trocr-small-printed/ \
98
+ --output trocr-small-printed-f32.gguf
99
+
100
+ # Quantize
101
+ crispembed-quantize trocr-small-printed-f32.gguf trocr-small-printed-q8_0.gguf q8_0
102
+ ```
103
+
104
+ ## Architecture Details
105
+
106
+ ```
107
+ Input: cropped text image (resized to 384x384 grayscale)
108
+ |
109
+ +-> DeiT-small encoder (12 layers)
110
+ | Patch embedding: 16x16 patches -> 576 tokens + CLS + distillation
111
+ | 12x: Pre-LN -> MHA (6 heads) -> FFN (GELU) -> residual
112
+ | Output: 578 x 384 encoder features
113
+ |
114
+ +-> TrOCR decoder (6 layers, autoregressive)
115
+ Token + position embedding (64044 vocab, 514 positions)
116
+ 6x: Self-attn (causal) -> Cross-attn (enc features) -> FFN
117
+ -> logits -> greedy argmax -> BPE detokenize
118
+ ```
119
+
120
+ ## License
121
+
122
+ MIT (same as microsoft/trocr-small-printed source model).