--- language: - cu - orv tags: - old-church-slavonic - old-east-slavic - early-slavic - historical-nlp - lemmatization - universal-dependencies - sequence-to-sequence - low-resource-languages license: cc-by-4.0 --- # OldSlavicLemma `OldSlavicLemma` is a neural lemmatizer designed for Early Slavic languages, especially **Old Church Slavonic (OCS)** and **Old East Slavic (OES)**. The model is intended for historical NLP, philological research, corpus annotation, lexical normalization, and the processing of morphologically rich ancient languages. ## Model description `OldSlavicLemma` is a dictionary-free character-level sequence-to-sequence lemmatizer. It maps an inflected token to its canonical lemma by using the target word together with its surrounding context. Unlike many traditional lemmatizers, the model does **not** require POS tags, morphological features, external dictionaries, or hand-written rules. It operates directly on character sequences. ## Specialization This model is especially specialized for **Early Slavic languages**, including: - Old Church Slavonic - Old East Slavic Early Slavic languages are challenging for standard NLP pipelines because they contain rich inflectional morphology, irregular orthography, dialectal variation, manuscript variation, and limited annotated corpora. ## Training data and models The main Early Slavic models was trained on the **Universal Dependencies treebank**. In addition to the specialized Early Slavic model, we evaluated the approach on a large multilingual Universal Dependencies setup covering: - **115 Universal Dependencies treebanks** - **60+ languages** - historical and modern language datasets We also provide **pretrained models** for multiple UD treebanks and languages, allowing users to apply the lemmatizer beyond Early Slavic. ## Evaluation `OldSlavicLemma` achieves strong results on Early Slavic UD treebanks. On UD v2.12 Early Slavic datasets, the model outperforms: - Old Church Slavonic PROIEL - Old East Slavic Birchbark - Old East Slavic RNC - Old East Slavic TOROT The model also shows strong performance on UD v2.15 Early Slavic treebanks, including: - PROIEL - TOROT - Birchbark - RNC - Ruthenian ## Intended use This model can be used for: - lemmatization of Old Church Slavonic - lemmatization of Old East Slavic - preprocessing historical Slavic corpora - corpus search and lexical normalization - philological and linguistic annotation - Universal Dependencies-based NLP pipelines - experiments on low-resource and historical languages ## Example use ```python # Example usage will be added soon.