COMBO-NLP Model for Ancient_Hebrew

Model Description

This is a Ancient_Hebrew-language model based on COMBO-NLP, an open-source natural language preprocessing system. It performs:

sentence segmentation (via LAMBO)
tokenisation (via LAMBO)
part-of-speech tagging
morphological analysis
lemmatisation
dependency parsing

The Ancient_Hebrew model uses FacebookAI/xlm-roberta-base as its base encoder and is trained on UD_Ancient_Hebrew-PTNK (UD v2.17).

Evaluation

Evaluation was performed on the UD_Ancient_Hebrew-PTNK test split using the standard CoNLL 2018 eval script.

Two evaluation rows are reported:

Full-text (F1): raw text is segmented by LAMBO, then parsed and compared against gold — measures end-to-end pipeline performance including segmentation quality.
Aligned accuracy: accuracy on correctly segmented (aligned) tokens — measures parsing quality on tokens that were correctly identified by the segmenter.

Morphosyntactic Tagging

Metric	Tokens	Sentences	Words	UPOS	XPOS	UFeats	AllTags	Lemmas
Full-text (F1)	99.50	94.60	88.82	86.81	86.96	85.00	84.05	87.32
Aligned accuracy	0.00	0.00	0.00	97.73	97.90	95.70	94.62	98.31

Dependency Parsing

Metric	UAS	LAS	CLAS	MLAS	BLEX
Full-text (F1)	74.99	72.91	65.15	59.40	63.47
Aligned accuracy	84.42	82.08	77.47	70.63	75.47

Usage

Install the library from PyPI (assuming you have a virtual environment created):

pip install combo-nlp

Install the Lambo segmenter - only needed when passing raw text strings to COMBO:

pip install --index-url https://pypi.clarin-pl.eu/ lambo

from combo import COMBO

# Load a pre-trained model with corresponding Lambo segmenter
nlp = COMBO("Ancient_Hebrew")

# Parse raw text (handles sentence splitting + tokenization)
result = nlp("השועל החום המהיר קופץ מעל הכלב העצלן.")

# Inspect results
for sentence in result:
    for token in sentence:
        print(f"{token.form:<15} {token.lemma:<15} {token.upos:<8} head={token.head}  {token.deprel}")

Refer to the COMBO-NLP documentation for installation and usage instructions:

License

The training data license: cc-by-sa-4.0 is derived from the Universal Dependencies treebank. For the full license terms of each treebank, please refer to the corresponding LICENSE.txt file in the treebank repository:

UD_Ancient_Hebrew-PTNK LICENSE.txt

Citation

If you use this model, please cite:

Ulewicz, M., Jabłońska, M., Klimaszewski, M., Przybyła, P., Pszenny, Ł., Rybak, P., Wiącek, M., & Wróblewska, A. (2026). COMBO-NLP Models Trained on UD v2.17. Zenodo. https://doi.org/10.5281/zenodo.19650523

@software{combo_nlp_2026,
  author    = {Ulewicz, Michał and Jabłońska, Maja and Klimaszewski, Mateusz and Przybyła, Piotr and Pszenny, Łukasz and Rybak, Piotr and Wiącek, Martyna and Wróblewska, Alina},
  title     = {{COMBO-NLP} Models Trained on {UD} v2.17},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19650523},
  url       = {https://doi.org/10.5281/zenodo.19650523}
}

Treebank References

If you use the Ancient_Hebrew PTNK treebank data, please also cite:

@inproceedings{swanson-tyers-2022-universal,
    title = "A {U}niversal {D}ependencies Treebank of {A}ncient {H}ebrew",
    author = "Swanson, Daniel  and
      Tyers, Francis",
    booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
    month = jun,
    year = "2022",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://aclanthology.org/2022.lrec-1.252",
    pages = "2353--2361",
    abstract = "In this paper we present the initial construction of a Universal Dependencies treebank with morphological annotations of Ancient Hebrew containing portions of the Hebrew Scriptures (1579 sentences, 27K tokens) for use in comparative study with ancient translations and for analysis of the development of Hebrew syntax. We construct this treebank by applying a rule-based parser (300 rules) to an existing morphologically-annotated corpus with minimal constituency structure and manually verifying the output and present the results of this semi-automated annotation process and some of the annotation decisions made in the process of applying the UD guidelines to a new language.",
}

Resources

COMBO-NLP: https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp
LAMBO: https://gitlab.clarin-pl.eu/syntactic-tools/lambo
UD_Ancient_Hebrew-PTNK: https://github.com/UniversalDependencies/UD_Ancient_Hebrew-PTNK

Downloads last month: 2

Dataset used to train clarin-pl/combo-nlp-xlm-roberta-base-ancient-hebrew-ptnk-ud2.17

Collection including clarin-pl/combo-nlp-xlm-roberta-base-ancient-hebrew-ptnk-ud2.17

COMBO-NLP UD 2.17 Models

Collection

125 items • Updated Apr 29