---
license: apache-2.0
language:
- kha
- en
metrics:
- cer
- wer
- exact_match
library_name: transformers
pipeline_tag: image-to-text
tags:
- ocr
- khasi
- deepseek
- vision
- multimodal
- unsloth
- lora
- handwriting-recognition
- document-understanding
base_model: unsloth/DeepSeek-OCR-2
datasets:
- toiar/Khasi-OCR-36K
---

# Dak-OCR v1: Khasi Optical Character Recognition model

**Dak-OCR** is a fine-tuned version of **DeepSeek-OCR-2** designed for accurate **OCR**, **Document Understanding**, and **Handwriting Recognition** in the **Khasi Language**.

It was trained on the custom [Khasi-OCR-36K dataset](https://huggingface.co/datasets/toiar/Khasi-OCR-36K) to reduce hallucination and repetition issues often seen in base multimodal models when working with low-resource languages. The model is designed to preserve document structure and produce clean **Markdown** output.

## Model Highlights
* **Language Support:** Native Khasi (using Latin script with special characters like ï and ñ) and English.
* **Task:** Specialized for "Free OCR" (transcribing document images into Markdown-formatted text, preserving headings, paragraphs, lists, and tables).
* **Robustness:** Highly resilient to degraded, noisy, and historical scans.
* **Base Model:** `unsloth/DeepSeek-OCR-2`
* **Hardware & Performance:** Training was conducted on an NVIDIA A100 SXM (80GB VRAM / 16 vCPU) leveraging native `bfloat16` precision for optimal quality and memory efficiency.
* **LoRA Setup:** Adaptation was implemented through LoRA with a rank of 64 (r=64) and lora_alpha=128. LoRA layers were applied broadly to linear components across vision encoders, language layers, attention blocks, and MLP modules to enable strong task specialization.
* **Vision Processing:** To effectively handle large or dense document pages, the model uses dynamic high-resolution multi-patch cropping (crop_mode=True) with a base resolution of 1024 and image size of 768, preventing loss of detail from aggressive downscaling.
* **Model Precision:** The final model is provided in native bfloat16 precision.

---

## Performance & Evaluation

The model was evaluated on a mixed set of 40 highly dense Khasi samples containing **complex markdown** and **degraded/noisy scans**.

### EVALUATION RESULTS

| Metric        | Score    |
|---------------|----------|
| WER           |   1.71%  |
| CER           |   0.91%  |

---

## Usage

```python
from unsloth import FastVisionModel
from transformers import AutoModel
import torch

# Load Model
model, tokenizer = FastVisionModel.from_pretrained(
    "toiar/Dak-OCR",
    load_in_4bit = False,
    auto_model = AutoModel,
    trust_remote_code = True,
    torch_dtype = torch.bfloat16,
)
FastVisionModel.for_inference(model)

model.generation_config.do_sample = False
model.generation_config.temperature = None
model.generation_config.top_p = None

# Inference
prompt = "<image>\nFree OCR."
image_path = "path/to/your/khasi_document.png"

with torch.no_grad():
    output = model.infer(
        tokenizer,
        prompt=prompt,
        image_file=image_path,
        base_size=1024,
        image_size=768,
        crop_mode=True
    )
    
print(output)