Kazakh Printed TrOCR

This model is a fine-tuned version of kazars24/trocr-base-handwritten-ru specifically optimized for recognizing Kazakh printed text. It leverages the TrOCR (Transformer-based Optical Character Recognition) architecture, utilizing a Vision Transformer (ViT) encoder and a RoBERTa-based decoder.

Model Description

The model was adapted to handle Kazakh-specific Cyrillic characters (Ó™, Ò“, Ò›, Ò£, Ó©, Ò±, Ò¯, Ò», Ñ–) by resizing token embeddings and training on synthetic data.

  • Developed by: Kamilya Nazarkhanova
  • Model type: VisionEncoderDecoder (TrOCR)
  • Language(s): Kazakh (kk)
  • Finetuned from: kazars24/trocr-base-handwritten-ru

Training Data & Lineage

The model was trained on thekamilya/kazakh-printed-dataset, which was synthetically generated using text from the ISSAI KazPARC corpus.

Data Generation Pipeline:

To overcome the scarcity of labeled Kazakh OCR data, I developed a robust synthetic generation engine:

  • Environmental Simulation: Implemented random "Light" and "Dark" mode background logic.
  • Stylistic Diversity: Randomized font selection, sizes, and text "jitter" to improve spatial invariance.
  • Optical Degradation: Applied stochastic Gaussian noise, random rotations, and varying JPEG compression artifacts (30–95%) to simulate real-world document quality.

How to Get Started


import torch
from PIL import Image
from transformers import TrOCRProcessor, VisionEncoderDecoderModel

processor = TrOCRProcessor.from_pretrained("thekamilya/kazakh-trocr-fine-tuned")
model = VisionEncoderDecoderModel.from_pretrained("thekamilya/kazakh-trocr-fine-tuned")


# Move model to GPU
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Load image
image = Image.open("zheke.jpg").convert("RGB")

# Prepare input and move to GPU
pixel_values = processor(
    images=image,
    return_tensors="pt"
).pixel_values.to(device)

# Inference on GPU
with torch.no_grad():
    generated_ids = model.generate(pixel_values)

# Decode text
generated_text = processor.batch_decode(
    generated_ids,
    skip_special_tokens=True
)[0]

print(f"Recognized Text: {generated_text}")
Downloads last month
7
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thekamilya/kazakh-trocr-fine-tuned

Finetuned
(3)
this model

Datasets used to train thekamilya/kazakh-trocr-fine-tuned

Evaluation results