OldSlavicLemma
OldSlavicLemma is a neural lemmatizer designed for Early Slavic languages, especially Old Church Slavonic (OCS) and Old East Slavic (OES).
The model is intended for historical NLP, philological research, corpus annotation, lexical normalization, and the processing of morphologically rich ancient languages.
Model description
OldSlavicLemma is a dictionary-free character-level sequence-to-sequence lemmatizer.
It maps an inflected token to its canonical lemma by using the target word together with its surrounding context.
Unlike many traditional lemmatizers, the model does not require POS tags, morphological features, external dictionaries, or hand-written rules. It operates directly on character sequences.
Specialization
This model is especially specialized for Early Slavic languages, including:
- Old Church Slavonic
- Old East Slavic
Early Slavic languages are challenging for standard NLP pipelines because they contain rich inflectional morphology, irregular orthography, dialectal variation, manuscript variation, and limited annotated corpora.
Training data and models
The main Early Slavic models was trained on the Universal Dependencies treebank.
In addition to the specialized Early Slavic model, we evaluated the approach on a large multilingual Universal Dependencies setup covering:
- 115 Universal Dependencies treebanks
- 60+ languages
- historical and modern language datasets
We also provide pretrained models for multiple UD treebanks and languages, allowing users to apply the lemmatizer beyond Early Slavic.
Evaluation
OldSlavicLemma achieves strong results on Early Slavic UD treebanks.
On UD v2.12 Early Slavic datasets, the model outperforms:
- Old Church Slavonic PROIEL
- Old East Slavic Birchbark
- Old East Slavic RNC
- Old East Slavic TOROT
The model also shows strong performance on UD v2.15 Early Slavic treebanks, including:
- PROIEL
- TOROT
- Birchbark
- RNC
- Ruthenian
Intended use
This model can be used for:
- lemmatization of Old Church Slavonic
- lemmatization of Old East Slavic
- preprocessing historical Slavic corpora
- corpus search and lexical normalization
- philological and linguistic annotation
- Universal Dependencies-based NLP pipelines
- experiments on low-resource and historical languages
Example use
# Example usage will be added soon.