File size: 3,916 Bytes
6cc5b12 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 | ---
license: cc-by-sa-4.0
language:
- ar
tags:
- dependency-parsing
- combo
- universal-dependencies
datasets:
- universal_dependencies
model-name: Combo Nlp Xlm Roberta Base Arabic Padt Ud2.17
pipeline_tag: token-classification
---
# COMBO-NLP Model for Arabic
## Model Description
This is a Arabic-language model based on [COMBO-NLP](https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp), an open-source natural language preprocessing system. It performs:
- sentence segmentation (via [LAMBO](https://gitlab.clarin-pl.eu/syntactic-tools/lambo))
- tokenisation (via [LAMBO](https://gitlab.clarin-pl.eu/syntactic-tools/lambo))
- part-of-speech tagging
- morphological analysis
- lemmatisation
- dependency parsing
The Arabic model uses ``FacebookAI/xlm-roberta-base`` as its base encoder and is trained on [UD_Arabic-PADT](https://github.com/UniversalDependencies/UD_Arabic-PADT) (UD v2.17).
## Evaluation
Evaluation was performed on the UD_Arabic-PADT test split using the standard [CoNLL 2018 eval script](https://universaldependencies.org/conll18/conll18_ud_eval.py).
Two evaluation rows are reported:
- **Full-text (F1)**: raw text is segmented by [LAMBO](https://gitlab.clarin-pl.eu/syntactic-tools/lambo), then parsed and compared against gold — measures end-to-end pipeline performance including segmentation quality.
- **Aligned accuracy**: accuracy on correctly segmented (aligned) tokens — measures parsing quality on tokens that were correctly identified by the segmenter.
### Morphosyntactic Tagging
| Metric | Tokens | Sentences | Words | UPOS | XPOS | UFeats | AllTags | Lemmas |
| ------ | ------ | --------- | ----- | ---- | ---- | ------ | ------- | ------ |
| Full-text (F1) | 99.78 | 73.33 | 97.39 | 94.75 | 92.50 | 92.59 | 92.01 | 93.03 |
| Aligned accuracy | 0.00 | 0.00 | 0.00 | 97.29 | 94.98 | 95.06 | 94.47 | 95.51 |
### Dependency Parsing
| Metric | UAS | LAS | CLAS | MLAS | BLEX |
| ------ | --- | --- | ---- | ---- | ---- |
| Full-text (F1) | 83.86 | 80.76 | 78.77 | 73.01 | 74.72 |
| Aligned accuracy | 86.11 | 82.92 | 81.19 | 75.26 | 77.02 |
## Usage
Install the library from PyPI (assuming you have a virtual environment created):
```bash
pip install combo-nlp
```
Install the Lambo segmenter - only needed when passing raw text strings to COMBO:
```bash
pip install --index-url https://pypi.clarin-pl.eu/ lambo
```
```python
from combo import COMBO
# Load a pre-trained model with corresponding Lambo segmenter
nlp = COMBO("Arabic")
# Parse raw text (handles sentence splitting + tokenization)
result = nlp("الثعلب البني السريع يقفز فوق الكلب الكسول.")
# Inspect results
for sentence in result:
for token in sentence:
print(f"{token.form:<15} {token.lemma:<15} {token.upos:<8} head={token.head} {token.deprel}")
```
Refer to the COMBO-NLP documentation for installation and usage instructions:
- [https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp](https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp)
- [https://gitlab.clarin-pl.eu/syntactic-tools/lambo](https://gitlab.clarin-pl.eu/syntactic-tools/lambo)
## Citation
<!--
```bibtex
@misc{combo_nlp_2026,
title = {COMBO-NLP Models Trained on UD v2.17},
author = {Ulewicz, Michał and Jabłońska, Maja and Klimaszewski, Mateusz and Przybyła, Piotr and Pszenny, Łukasz and Rybak, Piotr and Wiącek, Martyna and Wróblewska, Alina},
year = {2026},
howpublished = {\url{https://huggingface.co/collections/clarin-pl/combo-ud-217-models}}
}
```
-->
## Resources
- COMBO-NLP: [https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp](https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp)
- LAMBO: [https://gitlab.clarin-pl.eu/syntactic-tools/lambo](https://gitlab.clarin-pl.eu/syntactic-tools/lambo)
- UD_Arabic-PADT: [https://github.com/UniversalDependencies/UD_Arabic-PADT](https://github.com/UniversalDependencies/UD_Arabic-PADT)
|