---
license: mit
language:
- kk
datasets:
- issai/kaznerd
base_model:
- Eraly-ml/KazBERT
pipeline_tag: token-classification
library_name: transformers
---
# KazBERT for Named Entity Recognition (NER)

This model is a fine-tuned version of **KazBERT** (a specialized BERT model for the Kazakh language) on the **KazNERD** dataset. It is designed to identify and categorize named entities in Kazakh text into 25 distinct classes.

## Model Description

While multilingual models like XLM-RoBERTa cover many languages, **KazBERT-NER** is specifically optimized for the nuances of the Kazakh language. Despite having significantly fewer parameters than "Large" multilingual models, it achieves competitive performance, demonstrating superior efficiency and domain-specific knowledge (especially in complex categories like Adages).

* **Base Model:** [Eraly-ml/KazBERT](https://huggingface.co/Eraly-ml/KazBERT)
* **Task:** Token Classification (NER)
* **Dataset:** [issai/kaznerd](https://huggingface.co/datasets/issai/kaznerd)
* **Language:** Kazakh (kk)

## Training Hyperparameters

The model was trained with a focus on stability and fine-tuning the pre-existing semantic knowledge of KazBERT:

* **Learning Rate:** 
* **Batch Size:** 32
* **Epochs:** 8
* **Optimizer:** AdamW
* **Seed:** 1
* **Label Alignment:** Sub-word labels were aligned with the first token of the word (label_all_tokens=False).

## Evaluation Results

The model shows exceptional stability across both Validation and Test sets, proving its ability to generalize to unseen Kazakh text.

### Overall Performance (Test Set)

| Metric | Value |
| --- | --- |
| **F1-Score** | **95.22%** |
| **Precision** | **95.16%** |
| **Recall** | **95.28%** |

### Detailed Metrics by Category (Test Set)

The model excels in identifying core entities such as Persons, Dates, and Monetary values.

| Entity Class | Precision | Recall | F1-Score |
| --- | --- | --- | --- |
| **PERSON** | 98.46% | 98.25% | **98.35%** |
| **MONEY** | 98.86% | 98.41% | **98.64%** |
| **GPE** (Geopolitics) | 97.05% | 96.21% | **96.63%** |
| **CARDINAL** | 97.44% | 98.30% | **97.87%** |
| **DATE** | 96.79% | 96.90% | **96.84%** |
| **ADAGE** | 50.00% | 36.84% | 42.42% |

## Comparison vs XLM-RoBERTa Large

While models like xlm-roberta-large-kaznerd (560M params) may show slightly higher overall F1, KazBERT-NER (~110M params) offers:

Efficiency: 5x fewer parameters, leading to much faster inference and lower deployment costs.

Native Understanding: Better performance on culture-specific entities like ADAGE compared to many multilingual alternatives.

Clean Embeddings: Contextual representations focused purely on Kazakh syntax and semantics.


## How to Use

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = "Eraly-ml/KazBERT-NERD"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
example = "Қазақстан Республикасы — Орталық Азияда орналасқан мемлекет."

print(nlp(example))