metadata
license: cc-by-sa-4.0
language:
- ar
tags:
- dependency-parsing
- combo
- universal-dependencies
datasets:
- universal_dependencies
model-name: Combo Nlp Xlm Roberta Base Arabic Padt Ud2.17
pipeline_tag: token-classification
COMBO-NLP Model for Arabic
Model Description
This is a Arabic-language model based on COMBO-NLP, an open-source natural language preprocessing system. It performs:
- sentence segmentation (via LAMBO)
- tokenisation (via LAMBO)
- part-of-speech tagging
- morphological analysis
- lemmatisation
- dependency parsing
The Arabic model uses FacebookAI/xlm-roberta-base as its base encoder and is trained on UD_Arabic-PADT (UD v2.17).
Evaluation
Evaluation was performed on the UD_Arabic-PADT test split using the standard CoNLL 2018 eval script.
Two evaluation rows are reported:
- Full-text (F1): raw text is segmented by LAMBO, then parsed and compared against gold โ measures end-to-end pipeline performance including segmentation quality.
- Aligned accuracy: accuracy on correctly segmented (aligned) tokens โ measures parsing quality on tokens that were correctly identified by the segmenter.
Morphosyntactic Tagging
| Metric | Tokens | Sentences | Words | UPOS | XPOS | UFeats | AllTags | Lemmas |
|---|---|---|---|---|---|---|---|---|
| Full-text (F1) | 99.78 | 73.33 | 97.39 | 94.75 | 92.50 | 92.59 | 92.01 | 93.03 |
| Aligned accuracy | 0.00 | 0.00 | 0.00 | 97.29 | 94.98 | 95.06 | 94.47 | 95.51 |
Dependency Parsing
| Metric | UAS | LAS | CLAS | MLAS | BLEX |
|---|---|---|---|---|---|
| Full-text (F1) | 83.86 | 80.76 | 78.77 | 73.01 | 74.72 |
| Aligned accuracy | 86.11 | 82.92 | 81.19 | 75.26 | 77.02 |
Usage
Install the library from PyPI (assuming you have a virtual environment created):
pip install combo-nlp
Install the Lambo segmenter - only needed when passing raw text strings to COMBO:
pip install --index-url https://pypi.clarin-pl.eu/ lambo
from combo import COMBO
# Load a pre-trained model with corresponding Lambo segmenter
nlp = COMBO("Arabic")
# Parse raw text (handles sentence splitting + tokenization)
result = nlp("ุงูุซุนูุจ ุงูุจูู ุงูุณุฑูุน ูููุฒ ููู ุงูููุจ ุงููุณูู.")
# Inspect results
for sentence in result:
for token in sentence:
print(f"{token.form:<15} {token.lemma:<15} {token.upos:<8} head={token.head} {token.deprel}")
Refer to the COMBO-NLP documentation for installation and usage instructions:
- https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp
- https://gitlab.clarin-pl.eu/syntactic-tools/lambo