--- library_name: transformers license: apache-2.0 language: - tr tags: - modernbert - turkish - encoder - fill-mask - masked-language-modeling - nlp datasets: - HuggingFaceFW/fineweb-2 pipeline_tag: fill-mask model-index: - name: ModernBERT-TR results: - task: type: text-classification name: Text Classification dataset: type: custom name: Turkish NLP Benchmark (11 tasks) metrics: - type: f1 value: 60.2 name: Avg Score (Frozen Linear Probe) - task: type: text-classification name: TabiBench (28 tasks) dataset: type: custom name: TabiBench metrics: - type: f1 value: 77.28 name: Avg Score (Full Fine-Tuning) ---
ModernBERT-TR # ModernBERT-TR **A Modern Encoder Foundation Model for Turkish** Besher Alkurdi, Himmet Toprak Kesgin, Muzaffer Kaan Yuce, Mehmet Fatih Amasyali [Web Page](https://cosmos-ytu.github.io/modernbert-tr-1k/) · [Paper (soon)]() · [Training Code](https://github.com/Cosmos-YTU/ModernBERT) · [Evaluation Code](https://github.com/mrbesher/encoder-fast-eval)
## Overview ModernBERT-TR is a 150M-parameter Turkish encoder pretrained from scratch on **144.4B tokens** using the [ModernBERT](https://huggingface.co/answerdotai/ModernBERT-base) architecture. It uses a custom 50K WordPiece tokenizer optimized for Turkish morphology. **Architecture:** 22 layers, 768 hidden, 12 heads, RoPE, GLU, alternating local-global attention, Flash Attention, sequence-packed training. ## Results ### Frozen Linear Probing (11 Turkish NLP tasks) | Model | Params | Avg | |---|---|---| | **ModernBERT-TR (ours)** | **150M** | **60.2** | | Turkish-E5-large | 560M | 53.2 | | mmBERT | 307M | 54.9 | | TabiBERT | ~150M | 49.1 | | BERTurk | 111M | 35.3 | +13.1% relative over next-best. +70.3% relative over BERTurk. Outperforms models up to 4x larger. ### TabiBench Full Fine-Tuning (28 tasks) | Model | Params | Avg | |---|---|---| | ModernBERT-TR (ours) | 150M | 77.28 | | **TabiBERT** | **~150M** | **77.58** | | BERTurk | 110M | 75.96 | Leads in 5/8 categories (text classification, STS, NLI, academic understanding, information retrieval). TabiBERT leads in code retrieval and QA (trained on code/math data). ## Usage ```python from transformers import AutoModelForMaskedLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("ytu-ce-cosmos/modernbert-tr-base-1k") model = AutoModelForMaskedLM.from_pretrained("ytu-ce-cosmos/modernbert-tr-base-1k") text = "Türkiye'nin başkenti [MASK]'dır." inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) ``` ## Training Details | | | |---|---| | **Data** | FineWeb-2 Turkish (41.2B tokens) + BertTurk Corpus 5x (31.0B tokens) = 72.2B/epoch, 2 epochs | | **Tokenizer** | 50K WordPiece, trained on Turkish data | | **Optimizer** | StableAdamW, peak LR 2e-4, cosine schedule | | **Batch size** | 256 sequences (262K tokens/step) | | **MLM masking** | 30% (train) / 15% (eval) | | **Hardware** | 4x NVIDIA H100, 623 GPU-hours | | **Precision** | BF16 mixed precision | | **Context** | 1,024 tokens | ## Citation ```bibtex @article{alkurdi2025modernberttr, title={ModernBERT-TR: A Modern Encoder Foundation Model for Turkish}, author={Alkurdi, Besher and Kesgin, Himmet Toprak and Yuce, Muzaffer Kaan and Amasyali, Mehmet Fatih}, year={2025} } ``` ## Acknowledgments Supported by Yildiz Technical University (FDK-2024-6070) and TUBITAK (124E055). Built on the [ModernBERT](https://github.com/AnswerDotAI/ModernBERT) codebase with [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) and [BertTurk Corpus](https://github.com/dbmdz/berts) data.