--- base_model: - tachiwin/Tachiwin-OCR-1.5 tags: - text-generation-inference - transformers - unsloth - paddleocr_vl - trl - sft license: apache-2.0 datasets: - tachiwin/multilingual_ocr_llm_2 metrics: - cer library_name: adapter-transformers model-index: - name: Tachiwin-OCR-1.5 results: - task: type: image-to-text name: Optical Character Recognition (OCR) dataset: name: Tachiwin Multilingual OCR LLM type: tachiwin-multilingual-ocr-llm metrics: - name: Character Error Rate (CER) type: cer value: 2.03 - name: Word Error Rate (WER) type: wer value: 3.6 - name: OCR Accuracy (1 - CER) type: accuracy value: 97.97 - name: Word Accuracy (1 - WER) type: word-accuracy value: 96.4 pipeline_tag: image-text-to-text --- # TachiwinOCR 1.5 GGUF 🦡 **for the Indigenous Languages of Mexico** This is a PaddleOCR-VL Finetune specialized in the 68 indigenous languages of Mexico and their diverse character and glyph repertoire making a world first in tech access and linguistic rights ## Inference You can perform inference using the `PaddleOCR` pipeline or the `transformers` library. #### Option A: Using PaddleOCR ```python from paddleocr import PaddleOCRVL # Load the fine-tuned model pipeline = PaddleOCRVL( vl_rec_model_name="tachiwin/Tachiwin-OCR-1.5", vl_rec_model_dir=path_to_tachiwin_downloaded_model, ) # Predict on an image output = pipeline.predict("test.png") for res in output: res.print() res.save_to_json(save_path="output") res.save_to_markdown(save_path="output") ``` #### Option B: Using Transformers ```python from PIL import Image import torch from transformers import AutoModelForCausalLM, AutoProcessor MODEL = "tachiwin/Tachiwin-OCR-1.5" image_path = "my_image.png" DEVICE = "cuda" if torch.cuda.is_available() else "cpu" image = Image.open(image_path).convert("RGB") model = AutoModelForCausalLM.from_pretrained( MODEL, trust_remote_code=True, torch_dtype=torch.bfloat16 ).to(DEVICE).eval() processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True) messages = [ {"role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": "OCR:"}, ]} ] inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt" ).to(DEVICE) outputs = model.generate(**inputs, max_new_tokens=1024, min_new_tokens=1) generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0] print(generated_text) ``` --- ## 📊 Benchmark Results Tachiwin-OCR 1.5 was evaluated against the base PaddleOCR-VL 1.5 model using a diverse subset of Indigenous language samples. The fine-tuning results demonstrate **dramatic improvements** in both character and word recognition accuracy — far surpassing the gains seen in version 1.0. ### Summary Metrics | Metric | Base Model (Raw) | Tachiwin-OCR 1.5 (Fine-tuned) | Improvement | | :--- | :---: | :---: | :---: | | **Character Error Rate (CER)** | 17.65% | 2.03% | **88.5% (Relative Reduction)** | | **Word Error Rate (WER)** | 38.59% | 3.60% | **90.7% (Relative Reduction)** | | **OCR Accuracy (1 − CER)** | 82.35% | 97.97% | **+15.61pp (Absolute)** | | **Word Accuracy (1 − WER)** | 61.41% | 96.40% | **+34.99pp (Absolute)** | ### Version Comparison: 1.0 → 1.5 | Metric | Tachiwin-OCR v1.0 | Tachiwin-OCR v1.5 | Δ Change | | :--- | :---: | :---: | :---: | | **CER** | 6.80% | 2.03% | **−4.77pp** | | **WER** | 17.36% | 3.60% | **−13.76pp** | | **Accuracy (1 − CER)** | 93.20% | 97.97% | **+4.77pp** | | **Word Accuracy (1 − WER)** | 82.64% | 96.40% | **+13.76pp** | | **Relative CER Reduction** | 10.4% | 88.5% | **+78.1pp** | | **Relative WER Reduction** | 31.0% | 90.7% | **+59.7pp** | ### Detailed Comparison — v1.5 Sample Results Results across 21 language samples. Languages with tonal or complex diacritic systems show the most dramatic improvements: | # | Language Code | Raw CER | FT CER | Raw WER | FT WER | CER Improvement | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | | 0 | `zpo` (Zapotec) | 0.24% | 0.00% | 1.12% | 0.00% | +0.24% | | 1 | `maz` (Central Mazahua) | 0.41% | 0.00% | 2.27% | 0.00% | +0.41% | | 2 | `zao` (Zapotec) | 6.18% | 3.49% | 23.61% | 12.50% | +2.69% | | 3 | `mat` (Matlatzinca) | 6.51% | 0.00% | 42.55% | 0.00% | +6.51% | | 4 | `amu` (Amuzgo) | 85.52% | 0.00% | 89.13% | 0.00% | **+85.52%** | | 5 | `mxp` (Mixe) | 15.91% | 11.87% | 54.90% | 9.80% | +4.04% | | 6 | `yaq` (Yaqui) | 1.82% | 0.00% | 3.12% | 0.00% | +1.82% | | 7 | `poe` (Popoloca) | 6.78% | 3.39% | 62.50% | 12.50% | +3.39% | | 8 | `zpc` (Zapotec) | 9.43% | 2.05% | 42.11% | 13.16% | +7.38% | | 9 | `sei` (Seri) | 1.89% | 0.00% | 10.61% | 0.00% | +1.89% | | 10 | `lac` (Lacandon) | 9.80% | 0.00% | 42.31% | 0.00% | +9.80% | | 11 | `zao` (Zapotec) | 93.01% | 0.00% | 100.00% | 0.00% | **+93.01%** | | 12 | `mxt` (Mixtec) | 6.70% | 0.00% | 19.18% | 0.00% | +6.70% | | 13 | `huv` (San Marcos Huistepec Zapotec) | 1.41% | 0.00% | 10.34% | 0.00% | +1.41% | | 14 | `tee` (Huehuetla Tepehua) | 3.03% | 0.00% | 17.33% | 0.00% | +3.03% | | 15 | `tzh` (Tzeltal) | 2.67% | 0.00% | 15.91% | 0.00% | +2.67% | | 16 | `mto` (Totontepec Mixe) | 93.12% | 32.47% | 100.00% | 39.71% | +60.65% | | 17 | `amu` (Amuzgo) | 14.96% | 2.36% | 52.46% | 1.64% | +12.60% | | 18 | `mih` (Chayuco Mixtec) | 3.76% | 0.00% | 9.52% | 0.00% | +3.76% | | 19 | `zpm` (Mixtec) | 6.98% | 0.00% | 32.73% | 0.00% | +6.98% | | 20 | `toc` (Tojolabal) | 11.32% | 0.00% | 57.14% | 0.00% | +11.32% | | — | **AVERAGE** | **17.65%** | **2.03%** | **38.59%** | **3.60%** | **+15.61%** | ### Key Findings - **Unprecedented Accuracy Gains:** 14 out of 21 languages achieved a fine-tuned CER of **0.00%**, meaning perfect character-level recognition on those samples — a result not seen in v1.0. - **Hardest Cases Tackled:** Languages like Amuzgo (`amu`) and Zapotec (`zao`, sample 11) started with CERs above 85–93% and were reduced to zero after fine-tuning, representing improvements of over 85 and 93 percentage points respectively. - **Remaining Challenges:** `mto` (Totontepec Mixe) remains the most difficult language in the set, with a fine-tuned CER of 32.47% — still a 65% relative improvement over its raw baseline, but indicating further work is needed for highly complex orthographies. - **Word-Level Leap:** WER dropped from 38.59% to just 3.60% — a **34.98 percentage point** absolute improvement, compared to only 7.81pp in v1.0, demonstrating a qualitative leap in the model's ability to reconstruct full word forms in these language families. - **Robustness:** The model continues to show high resilience against synthetic distortions applied during the data generation phase. **Tachiwin** (from Totonac - "Language") is dedicated to bridging the digital divide for indigenous languages of Mexico through AI technology. - **Developed by:** Tachiwin - **License:** apache-2.0 - **Finetuned from model :** PaddlePaddle/PaddleOCR-VL-1.5 This paddleocr_vl model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) [](https://github.com/unslothai/unsloth)