Kazakh Printed TrOCR

This model is a fine-tuned version of kazars24/trocr-base-handwritten-ru specifically optimized for recognizing Kazakh printed text. It leverages the TrOCR (Transformer-based Optical Character Recognition) architecture, utilizing a Vision Transformer (ViT) encoder and a RoBERTa-based decoder.

Model Description

The model was adapted to handle Kazakh-specific Cyrillic characters (ә, ғ, қ, ң, ө, ұ, ү, һ, і) by resizing token embeddings and training on synthetic data.

Developed by: Kamilya Nazarkhanova
Model type: VisionEncoderDecoder (TrOCR)
Language(s): Kazakh (kk)
Finetuned from: kazars24/trocr-base-handwritten-ru

Training Data & Lineage

The model was trained on thekamilya/kazakh-printed-dataset, which was synthetically generated using text from the ISSAI KazPARC corpus.

Data Generation Pipeline:

To overcome the scarcity of labeled Kazakh OCR data, I developed a robust synthetic generation engine:

Environmental Simulation: Implemented random "Light" and "Dark" mode background logic.
Stylistic Diversity: Randomized font selection, sizes, and text "jitter" to improve spatial invariance.
Optical Degradation: Applied stochastic Gaussian noise, random rotations, and varying JPEG compression artifacts (30–95%) to simulate real-world document quality.

How to Get Started


import torch
from PIL import Image
from transformers import TrOCRProcessor, VisionEncoderDecoderModel

processor = TrOCRProcessor.from_pretrained("thekamilya/kazakh-trocr-fine-tuned")
model = VisionEncoderDecoderModel.from_pretrained("thekamilya/kazakh-trocr-fine-tuned")


# Move model to GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Load image
image = Image.open("zheke.jpg").convert("RGB")

# Prepare input and move to GPU
pixel_values = processor(
    images=image,
    return_tensors="pt"
).pixel_values.to(device)

# Inference on GPU
with torch.no_grad():
    generated_ids = model.generate(pixel_values)

# Decode text
generated_text = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True
)[0]

print(f"Recognized Text: {generated_text}")

Downloads last month: 7

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for thekamilya/kazakh-trocr-fine-tuned

Base model

microsoft/trocr-base-handwritten

Finetuned

kazars24/trocr-base-handwritten-ru

Finetuned

(3)

this model

Datasets used to train thekamilya/kazakh-trocr-fine-tuned

Evaluation results

Character Error Rate on Kazakh Printed Dataset
self-reported

3.700
Exact Match (%) on Kazakh Printed Dataset
self-reported

48.620