COMBO-NLP Model for Ancient_Hebrew
Model Description
This is a Ancient_Hebrew-language model based on COMBO-NLP, an open-source natural language preprocessing system. It performs:
- sentence segmentation (via LAMBO)
- tokenisation (via LAMBO)
- part-of-speech tagging
- morphological analysis
- lemmatisation
- dependency parsing
The Ancient_Hebrew model uses FacebookAI/xlm-roberta-base as its base encoder and is trained on UD_Ancient_Hebrew-PTNK (UD v2.17).
Evaluation
Evaluation was performed on the UD_Ancient_Hebrew-PTNK test split using the standard CoNLL 2018 eval script.
Two evaluation rows are reported:
- Full-text (F1): raw text is segmented by LAMBO, then parsed and compared against gold — measures end-to-end pipeline performance including segmentation quality.
- Aligned accuracy: accuracy on correctly segmented (aligned) tokens — measures parsing quality on tokens that were correctly identified by the segmenter.
Morphosyntactic Tagging
| Metric | Tokens | Sentences | Words | UPOS | XPOS | UFeats | AllTags | Lemmas |
|---|---|---|---|---|---|---|---|---|
| Full-text (F1) | 99.50 | 94.60 | 88.82 | 86.81 | 86.96 | 85.00 | 84.05 | 87.32 |
| Aligned accuracy | 0.00 | 0.00 | 0.00 | 97.73 | 97.90 | 95.70 | 94.62 | 98.31 |
Dependency Parsing
| Metric | UAS | LAS | CLAS | MLAS | BLEX |
|---|---|---|---|---|---|
| Full-text (F1) | 74.99 | 72.91 | 65.15 | 59.40 | 63.47 |
| Aligned accuracy | 84.42 | 82.08 | 77.47 | 70.63 | 75.47 |
Usage
Install the library from PyPI (assuming you have a virtual environment created):
pip install combo-nlp
Install the Lambo segmenter - only needed when passing raw text strings to COMBO:
pip install --index-url https://pypi.clarin-pl.eu/ lambo
from combo import COMBO
# Load a pre-trained model with corresponding Lambo segmenter
nlp = COMBO("Ancient_Hebrew")
# Parse raw text (handles sentence splitting + tokenization)
result = nlp("השועל החום המהיר קופץ מעל הכלב העצלן.")
# Inspect results
for sentence in result:
for token in sentence:
print(f"{token.form:<15} {token.lemma:<15} {token.upos:<8} head={token.head} {token.deprel}")
Refer to the COMBO-NLP documentation for installation and usage instructions:
- https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp
- https://gitlab.clarin-pl.eu/syntactic-tools/lambo
License
The training data license: cc-by-sa-4.0 is derived from the Universal Dependencies treebank. For the full license terms of each treebank, please refer to the corresponding LICENSE.txt file in the treebank repository:
Citation
If you use this model, please cite:
Ulewicz, M., Jabłońska, M., Klimaszewski, M., Przybyła, P., Pszenny, Ł., Rybak, P., Wiącek, M., & Wróblewska, A. (2026). COMBO-NLP Models Trained on UD v2.17. Zenodo. https://doi.org/10.5281/zenodo.19650523
@software{combo_nlp_2026,
author = {Ulewicz, Michał and Jabłońska, Maja and Klimaszewski, Mateusz and Przybyła, Piotr and Pszenny, Łukasz and Rybak, Piotr and Wiącek, Martyna and Wróblewska, Alina},
title = {{COMBO-NLP} Models Trained on {UD} v2.17},
year = {2026},
publisher = {Zenodo},
doi = {10.5281/zenodo.19650523},
url = {https://doi.org/10.5281/zenodo.19650523}
}
Treebank References
If you use the Ancient_Hebrew PTNK treebank data, please also cite:
@inproceedings{swanson-tyers-2022-universal,
title = "A {U}niversal {D}ependencies Treebank of {A}ncient {H}ebrew",
author = "Swanson, Daniel and
Tyers, Francis",
booktitle = "Proceedings of the Thirteenth Language Resources and Evaluation Conference",
month = jun,
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2022.lrec-1.252",
pages = "2353--2361",
abstract = "In this paper we present the initial construction of a Universal Dependencies treebank with morphological annotations of Ancient Hebrew containing portions of the Hebrew Scriptures (1579 sentences, 27K tokens) for use in comparative study with ancient translations and for analysis of the development of Hebrew syntax. We construct this treebank by applying a rule-based parser (300 rules) to an existing morphologically-annotated corpus with minimal constituency structure and manually verifying the output and present the results of this semi-automated annotation process and some of the annotation decisions made in the process of applying the UD guidelines to a new language.",
}
Resources
- COMBO-NLP: https://gitlab.clarin-pl.eu/syntactic-tools/combo-nlp
- LAMBO: https://gitlab.clarin-pl.eu/syntactic-tools/lambo
- UD_Ancient_Hebrew-PTNK: https://github.com/UniversalDependencies/UD_Ancient_Hebrew-PTNK
- Downloads last month
- 2