--- license: apache-2.0 language: - kha - en metrics: - cer - wer - exact_match library_name: transformers pipeline_tag: image-to-text tags: - ocr - khasi - deepseek - vision - multimodal - unsloth - lora - handwriting-recognition - document-understanding base_model: unsloth/DeepSeek-OCR-2 datasets: - toiar/Khasi-OCR-36K --- # Dak-OCR v1: Khasi Optical Character Recognition model **Dak-OCR** is a fine-tuned version of **DeepSeek-OCR-2** designed for accurate **OCR**, **Document Understanding**, and **Handwriting Recognition** in the **Khasi Language**. It was trained on the custom [Khasi-OCR-36K dataset](https://huggingface.co/datasets/toiar/Khasi-OCR-36K) to reduce hallucination and repetition issues often seen in base multimodal models when working with low-resource languages. The model is designed to preserve document structure and produce clean **Markdown** output. ## Model Highlights * **Language Support:** Native Khasi (using Latin script with special characters like ï and ñ) and English. * **Task:** Specialized for "Free OCR" (transcribing document images into Markdown-formatted text, preserving headings, paragraphs, lists, and tables). * **Robustness:** Highly resilient to degraded, noisy, and historical scans. * **Base Model:** `unsloth/DeepSeek-OCR-2` * **Hardware & Performance:** Training was conducted on an NVIDIA A100 SXM (80GB VRAM / 16 vCPU) leveraging native `bfloat16` precision for optimal quality and memory efficiency. * **LoRA Setup:** Adaptation was implemented through LoRA with a rank of 64 (r=64) and lora_alpha=128. LoRA layers were applied broadly to linear components across vision encoders, language layers, attention blocks, and MLP modules to enable strong task specialization. * **Vision Processing:** To effectively handle large or dense document pages, the model uses dynamic high-resolution multi-patch cropping (crop_mode=True) with a base resolution of 1024 and image size of 768, preventing loss of detail from aggressive downscaling. * **Model Precision:** The final model is provided in native bfloat16 precision. --- ## Performance & Evaluation The model was evaluated on a mixed set of 40 highly dense Khasi samples containing **complex markdown** and **degraded/noisy scans**. ### EVALUATION RESULTS | Metric | Score | |---------------|----------| | WER | 1.71% | | CER | 0.91% | --- ## Usage ```python from unsloth import FastVisionModel from transformers import AutoModel import torch # Load Model model, tokenizer = FastVisionModel.from_pretrained( "toiar/Dak-OCR", load_in_4bit = False, auto_model = AutoModel, trust_remote_code = True, torch_dtype = torch.bfloat16, ) FastVisionModel.for_inference(model) model.generation_config.do_sample = False model.generation_config.temperature = None model.generation_config.top_p = None # Inference prompt = "\nFree OCR." image_path = "path/to/your/khasi_document.png" with torch.no_grad(): output = model.infer( tokenizer, prompt=prompt, image_file=image_path, base_size=1024, image_size=768, crop_mode=True ) print(output)