ClusterlabAi/101_billion_arabic_words_dataset
Viewer • Updated • 33.1M • 730 • 72
Configuration Parsing Warning:Config file tokenizer_config.json cannot be fetched (too big)
answerdotai/ModernBERT-base via continued pretraining on Arabic corpora (~9.8GB).from transformers import AutoTokenizer, AutoModelForMaskedLM
name = "gizadatateam/ModernAraBERT"
model = AutoModelForMaskedLM.from_pretrained(name)
tokenizer = AutoTokenizer.from_pretrained(name)
| Model | LABR | HARD | AJGT |
|---|---|---|---|
| AraBERTv1 | 45.35 | 72.65 | 58.01 |
| AraBERTv2 | 45.79 | 67.10 | 53.59 |
| mBERT | 44.18 | 71.70 | 61.55 |
| MARBERT | 45.54 | 67.39 | 60.63 |
| ModernAraBERT | 56.45 | 89.37 | 70.54 |
| Model | Macro-F1 |
|---|---|
| AraBERTv1 | 13.46 |
| AraBERTv2 | 16.77 |
| mBERT | 12.15 |
| MARBERT | 7.42 |
| ModernAraBERT | 28.23 |
| Model | EM |
|---|---|
| AraBERT | 25.36 |
| AraBERTv2 | 26.08 |
| mBERT | 25.12 |
| MARBERT | 23.58 |
| ModernAraBERT | 27.10 |
@inproceedings{<paper_id>,
title={Efficient Adaptation of English Language Models for Low-Resource and Morphologically Rich Languages: The Case of Arabic},
author={Maher, Eldamaty, Ashraf, ElShawi, Mostafa},
booktitle={Proceedings of <conference_name>},
year={2025},
organization={<conference_name>}
}
Base model
answerdotai/ModernBERT-base