---
language:
- cu
- orv
tags:
- old-church-slavonic
- old-east-slavic
- early-slavic
- historical-nlp
- lemmatization
- universal-dependencies
- sequence-to-sequence
- low-resource-languages
license: cc-by-4.0
---

# OldSlavicLemma

`OldSlavicLemma` is a neural lemmatizer designed for Early Slavic languages, especially **Old Church Slavonic (OCS)** and **Old East Slavic (OES)**.

The model is intended for historical NLP, philological research, corpus annotation, lexical normalization, and the processing of morphologically rich ancient languages.

## Model description

`OldSlavicLemma` is a dictionary-free character-level sequence-to-sequence lemmatizer.  
It maps an inflected token to its canonical lemma by using the target word together with its surrounding context.


Unlike many traditional lemmatizers, the model does **not** require POS tags, morphological features, external dictionaries, or hand-written rules. It operates directly on character sequences.

## Specialization

This model is especially specialized for **Early Slavic languages**, including:

- Old Church Slavonic
- Old East Slavic


Early Slavic languages are challenging for standard NLP pipelines because they contain rich inflectional morphology, irregular orthography, dialectal variation, manuscript variation, and limited annotated corpora.

## Training data and models

The main Early Slavic models was trained on the **Universal Dependencies treebank**.

In addition to the specialized Early Slavic model, we evaluated the approach on a large multilingual Universal Dependencies setup covering:

- **115 Universal Dependencies treebanks**
- **60+ languages**
- historical and modern language datasets

We also provide **pretrained models** for multiple UD treebanks and languages, allowing users to apply the lemmatizer beyond Early Slavic.

## Evaluation

`OldSlavicLemma` achieves strong results on Early Slavic UD treebanks.

On UD v2.12 Early Slavic datasets, the model outperforms:

- Old Church Slavonic PROIEL
- Old East Slavic Birchbark
- Old East Slavic RNC
- Old East Slavic TOROT

The model also shows strong performance on UD v2.15 Early Slavic treebanks, including:

- PROIEL
- TOROT
- Birchbark
- RNC
- Ruthenian

## Intended use

This model can be used for:

- lemmatization of Old Church Slavonic
- lemmatization of Old East Slavic
- preprocessing historical Slavic corpora
- corpus search and lexical normalization
- philological and linguistic annotation
- Universal Dependencies-based NLP pipelines
- experiments on low-resource and historical languages

## Example use

```python
# Example usage will be added soon.