Upload folder using huggingface_hub

6cc5b12 verified about 2 months ago

3.92 kB

license: cc-by-sa-4.0
language:
  - ar
tags:
  - dependency-parsing
  - combo
  - universal-dependencies
datasets:
  - universal_dependencies
model-name: Combo Nlp Xlm Roberta Base Arabic Padt Ud2.17
pipeline_tag: token-classification

COMBO-NLP Model for Arabic

Model Description

This is a Arabic-language model based on COMBO-NLP, an open-source natural language preprocessing system. It performs:

sentence segmentation (via LAMBO)
tokenisation (via LAMBO)
part-of-speech tagging
morphological analysis
lemmatisation
dependency parsing

The Arabic model uses FacebookAI/xlm-roberta-base as its base encoder and is trained on UD_Arabic-PADT (UD v2.17).

Evaluation

Evaluation was performed on the UD_Arabic-PADT test split using the standard CoNLL 2018 eval script.

Two evaluation rows are reported:

Full-text (F1): raw text is segmented by LAMBO, then parsed and compared against gold — measures end-to-end pipeline performance including segmentation quality.
Aligned accuracy: accuracy on correctly segmented (aligned) tokens — measures parsing quality on tokens that were correctly identified by the segmenter.

Morphosyntactic Tagging

Metric	Tokens	Sentences	Words	UPOS	XPOS	UFeats	AllTags	Lemmas
Full-text (F1)	99.78	73.33	97.39	94.75	92.50	92.59	92.01	93.03
Aligned accuracy	0.00	0.00	0.00	97.29	94.98	95.06	94.47	95.51

Dependency Parsing

Metric	UAS	LAS	CLAS	MLAS	BLEX
Full-text (F1)	83.86	80.76	78.77	73.01	74.72
Aligned accuracy	86.11	82.92	81.19	75.26	77.02

Usage

Install the library from PyPI (assuming you have a virtual environment created):

pip install combo-nlp

Install the Lambo segmenter - only needed when passing raw text strings to COMBO:

pip install --index-url https://pypi.clarin-pl.eu/ lambo

from combo import COMBO

# Load a pre-trained model with corresponding Lambo segmenter
nlp = COMBO("Arabic")

# Parse raw text (handles sentence splitting + tokenization)
result = nlp("الثعلب البني السريع يقفز فوق الكلب الكسول.")

# Inspect results
for sentence in result:
    for token in sentence:
        print(f"{token.form:<15} {token.lemma:<15} {token.upos:<8} head={token.head}  {token.deprel}")

Refer to the COMBO-NLP documentation for installation and usage instructions:

Citation

Resources

COMBO-NLP: https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp
LAMBO: https://gitlab.clarin-pl.eu/syntactic-tools/lambo
UD_Arabic-PADT: https://github.com/UniversalDependencies/UD_Arabic-PADT