--- license: mit language: - kk datasets: - issai/kaznerd base_model: - Eraly-ml/KazBERT pipeline_tag: token-classification library_name: transformers --- # KazBERT for Named Entity Recognition (NER) This model is a fine-tuned version of **KazBERT** (a specialized BERT model for the Kazakh language) on the **KazNERD** dataset. It is designed to identify and categorize named entities in Kazakh text into 25 distinct classes. ## Model Description While multilingual models like XLM-RoBERTa cover many languages, **KazBERT-NER** is specifically optimized for the nuances of the Kazakh language. Despite having significantly fewer parameters than "Large" multilingual models, it achieves competitive performance, demonstrating superior efficiency and domain-specific knowledge (especially in complex categories like Adages). * **Base Model:** [Eraly-ml/KazBERT](https://huggingface.co/Eraly-ml/KazBERT) * **Task:** Token Classification (NER) * **Dataset:** [issai/kaznerd](https://huggingface.co/datasets/issai/kaznerd) * **Language:** Kazakh (kk) ## Training Hyperparameters The model was trained with a focus on stability and fine-tuning the pre-existing semantic knowledge of KazBERT: * **Learning Rate:** * **Batch Size:** 32 * **Epochs:** 8 * **Optimizer:** AdamW * **Seed:** 1 * **Label Alignment:** Sub-word labels were aligned with the first token of the word (label_all_tokens=False). ## Evaluation Results The model shows exceptional stability across both Validation and Test sets, proving its ability to generalize to unseen Kazakh text. ### Overall Performance (Test Set) | Metric | Value | | --- | --- | | **F1-Score** | **95.22%** | | **Precision** | **95.16%** | | **Recall** | **95.28%** | ### Detailed Metrics by Category (Test Set) The model excels in identifying core entities such as Persons, Dates, and Monetary values. | Entity Class | Precision | Recall | F1-Score | | --- | --- | --- | --- | | **PERSON** | 98.46% | 98.25% | **98.35%** | | **MONEY** | 98.86% | 98.41% | **98.64%** | | **GPE** (Geopolitics) | 97.05% | 96.21% | **96.63%** | | **CARDINAL** | 97.44% | 98.30% | **97.87%** | | **DATE** | 96.79% | 96.90% | **96.84%** | | **ADAGE** | 50.00% | 36.84% | 42.42% | ## Comparison vs XLM-RoBERTa Large While models like xlm-roberta-large-kaznerd (560M params) may show slightly higher overall F1, KazBERT-NER (~110M params) offers: Efficiency: 5x fewer parameters, leading to much faster inference and lower deployment costs. Native Understanding: Better performance on culture-specific entities like ADAGE compared to many multilingual alternatives. Clean Embeddings: Contextual representations focused purely on Kazakh syntax and semantics. ## How to Use ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline model_name = "Eraly-ml/KazBERT-NERD" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple") example = "Қазақстан Республикасы — Орталық Азияда орналасқан мемлекет." print(nlp(example))