---
language:
  - grt
license: cc-by-4.0
tags:
  - ocr
  - florence-2
  - garo
  - northeast-india
  - image-to-text
base_model: microsoft/Florence-2-base-ft
metrics:
  - character_accuracy
model-index:
  - name: MWirelabs/garo-ocr
    results:
      - task:
          type: image-to-text
          name: OCR
        metrics:
          - type: character_accuracy
            value: 93.13
            name: Character Accuracy (1000 samples)
---

# GaroOCR

![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)
![Character Accuracy](https://img.shields.io/badge/Char%20Accuracy-93.13%25-brightgreen)

OCR model for the Garo (grt_Latn) language, fine-tuned from `microsoft/Florence-2-base-ft` on Garo text images.

Developed by **MWire Labs**, Shillong, Meghalaya; part of an ongoing effort to build foundational AI for Northeast Indian languages.

---

## Model Details

| | |
|---|---|
| Base model | `microsoft/Florence-2-base-ft` |
| Parameters | 231M |
| Language | Garo (Achik) |
| Task | OCR (image → text) |
| Training samples | 80,000 |
| Epochs | 5 |
| Character Accuracy | 93.13% |

---

## Training Setup

- **Hardware:** NVIDIA A40 (48GB)
- **Precision:** bfloat16
- **Batch size:** 4 (effective 16 with gradient accumulation)
- **Learning rate:** 3e-4 with cosine scheduler
- **Max label length:** 128 tokens
- **Task prompt:** `<OCR>` (Florence-2 uppercase token)

---

## Usage

```python
from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image
import torch

processor = AutoProcessor.from_pretrained("MWirelabs/garo-ocr", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "MWirelabs/garo-ocr",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).cuda()

image = Image.open("your_image.png").convert("RGB")
inputs = processor(text="<OCR>", images=image, return_tensors="pt")
inputs = {k: v.cuda() for k, v in inputs.items()}
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)

with torch.no_grad():
    generated = model.generate(
        pixel_values=inputs["pixel_values"],
        input_ids=inputs["input_ids"],
        max_new_tokens=128,
    )

text = processor.tokenizer.decode(generated[0], skip_special_tokens=True)
print(text)
```

> **Note:** Use `transformers==4.38.2` for compatibility.

---

## Limitations

- Max reliable output length is ~128 tokens
- Part of MWire Labs' mono-language series; a multilingual NE-OCR model covering more Northeast Indian languages is in development

---