| --- |
| license: cc-by-sa-4.0 |
| language: |
| - ar |
| tags: |
| - dependency-parsing |
| - combo |
| - universal-dependencies |
| datasets: |
| - universal_dependencies |
| model-name: Combo Nlp Xlm Roberta Base Arabic Padt Ud2.17 |
| pipeline_tag: token-classification |
| --- |
| |
| # COMBO-NLP Model for Arabic |
|
|
| ## Model Description |
|
|
| This is a Arabic-language model based on [COMBO-NLP](https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp), an open-source natural language preprocessing system. It performs: |
|
|
| - sentence segmentation (via [LAMBO](https://gitlab.clarin-pl.eu/syntactic-tools/lambo)) |
| - tokenisation (via [LAMBO](https://gitlab.clarin-pl.eu/syntactic-tools/lambo)) |
| - part-of-speech tagging |
| - morphological analysis |
| - lemmatisation |
| - dependency parsing |
|
|
| The Arabic model uses ``FacebookAI/xlm-roberta-base`` as its base encoder and is trained on [UD_Arabic-PADT](https://github.com/UniversalDependencies/UD_Arabic-PADT) (UD v2.17). |
|
|
| ## Evaluation |
|
|
| Evaluation was performed on the UD_Arabic-PADT test split using the standard [CoNLL 2018 eval script](https://universaldependencies.org/conll18/conll18_ud_eval.py). |
| |
| Two evaluation rows are reported: |
| - **Full-text (F1)**: raw text is segmented by [LAMBO](https://gitlab.clarin-pl.eu/syntactic-tools/lambo), then parsed and compared against gold — measures end-to-end pipeline performance including segmentation quality. |
| - **Aligned accuracy**: accuracy on correctly segmented (aligned) tokens — measures parsing quality on tokens that were correctly identified by the segmenter. |
| |
| ### Morphosyntactic Tagging |
| |
| | Metric | Tokens | Sentences | Words | UPOS | XPOS | UFeats | AllTags | Lemmas | |
| | ------ | ------ | --------- | ----- | ---- | ---- | ------ | ------- | ------ | |
| | Full-text (F1) | 99.78 | 73.33 | 97.39 | 94.75 | 92.50 | 92.59 | 92.01 | 93.03 | |
| | Aligned accuracy | 0.00 | 0.00 | 0.00 | 97.29 | 94.98 | 95.06 | 94.47 | 95.51 | |
| |
| ### Dependency Parsing |
| |
| | Metric | UAS | LAS | CLAS | MLAS | BLEX | |
| | ------ | --- | --- | ---- | ---- | ---- | |
| | Full-text (F1) | 83.86 | 80.76 | 78.77 | 73.01 | 74.72 | |
| | Aligned accuracy | 86.11 | 82.92 | 81.19 | 75.26 | 77.02 | |
| |
| |
| ## Usage |
| |
| Install the library from PyPI (assuming you have a virtual environment created): |
| |
| ```bash |
| pip install combo-nlp |
| ``` |
| |
| Install the Lambo segmenter - only needed when passing raw text strings to COMBO: |
| |
| ```bash |
| pip install --index-url https://pypi.clarin-pl.eu/ lambo |
| ``` |
| |
| ```python |
| from combo import COMBO |
| |
| # Load a pre-trained model with corresponding Lambo segmenter |
| nlp = COMBO("Arabic") |
| |
| # Parse raw text (handles sentence splitting + tokenization) |
| result = nlp("الثعلب البني السريع يقفز فوق الكلب الكسول.") |
| |
| # Inspect results |
| for sentence in result: |
| for token in sentence: |
| print(f"{token.form:<15} {token.lemma:<15} {token.upos:<8} head={token.head} {token.deprel}") |
| ``` |
| |
| Refer to the COMBO-NLP documentation for installation and usage instructions: |
| |
| - [https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp](https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp) |
| - [https://gitlab.clarin-pl.eu/syntactic-tools/lambo](https://gitlab.clarin-pl.eu/syntactic-tools/lambo) |
| |
| ## Citation |
| |
| <!-- |
| ```bibtex |
| @misc{combo_nlp_2026, |
| title = {COMBO-NLP Models Trained on UD v2.17}, |
| author = {Ulewicz, Michał and Jabłońska, Maja and Klimaszewski, Mateusz and Przybyła, Piotr and Pszenny, Łukasz and Rybak, Piotr and Wiącek, Martyna and Wróblewska, Alina}, |
| year = {2026}, |
| howpublished = {\url{https://huggingface.co/collections/clarin-pl/combo-ud-217-models}} |
| } |
| ``` |
| --> |
| |
| ## Resources |
| |
| - COMBO-NLP: [https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp](https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp) |
| - LAMBO: [https://gitlab.clarin-pl.eu/syntactic-tools/lambo](https://gitlab.clarin-pl.eu/syntactic-tools/lambo) |
| - UD_Arabic-PADT: [https://github.com/UniversalDependencies/UD_Arabic-PADT](https://github.com/UniversalDependencies/UD_Arabic-PADT) |
|
|
|
|