---
language:
- cu
- orv
tags:
- old-church-slavonic
- old-east-slavic
- early-slavic
- historical-nlp
- lemmatization
- universal-dependencies
- sequence-to-sequence
- character-level-modeling
- low-resource-languages
- pytorch
license: cc-by-nc-4.0
library_name: pytorch
---

# OldSlavicLemma

`OldSlavicLemma` is a neural sequence-to-sequence lemmatizer for historical and low-resource languages, with a primary focus on **Old Church Slavonic (OCS)** and **Old East Slavic (OES)**.

The model is designed for texts with rich inflectional morphology, orthographic variation, dialectal variation, manuscript variation, and limited annotated resources.

## Model description

`OldSlavicLemma` is a dictionary-free character-level encoder-decoder lemmatizer. It predicts the canonical lemma of a target token using the token together with its local left and right context.

Unlike many pipeline-based lemmatizers, the model does **not** require POS tags, morphological features, external dictionaries, or hand-written rules as input. It operates directly over character sequences and generates lemmas character by character.

The architecture is based on recurrent sequence modeling with attention mechanisms, making it suitable for historical forms where spelling variation and unseen inflected tokens are common.

## Online demo

An online demo is available for interactive lemmatization:

https://usmannawaz01.github.io/OldSlavicLemma/

The demo can be used to lemmatize individual words, tokens, or short text.


## GitHub repository

Source code and training instructions are available here:

https://github.com/usmannawaz01/OldSlavicLemma

The GitHub repository contains the training package, command-line interface, and instructions for retraining models on Universal Dependencies treebanks.

## Pretrained models

Pretrained `OldSlavicLemma` models are available in this Hugging Face repository.

These models are provided to support reuse without retraining. The repository includes pretrained checkpoints for Early Slavic and additional Universal Dependencies treebanks.

A typical exported model folder contains:

```text
model.pt
vocab.json
config.json

```

The `.pt` file stores the trained PyTorch weights. The JSON files store the vocabulary, model configuration.

## Available Early Slavic model IDs

```text
cu_proiel       Old Church Slavonic — PROIEL
orv_birchbark   Old East Slavic — Birchbark
orv_rnc         Old East Slavic — RNC
orv_torot       Old East Slavic — TOROT
```

## Specialization

`OldSlavicLemma` is especially specialized for Early Slavic lemmatization, including:

- Old Church Slavonic — PROIEL
- Old East Slavic — Birchbark
- Old East Slavic — RNC
- Old East Slavic — TOROT

Early Slavic languages are challenging for standard NLP pipelines because they exhibit complex inflectional morphology, irregular orthography, manuscript variation, dialectal variation, and limited annotated corpora.

## How to use pretrained models in Python

The easiest way to use the pretrained models programmatically is to use the inference code from the Hugging Face Space.

Clone the Space repository:

```bash
git clone https://huggingface.co/spaces/usmannawaz/oldslaviclemma
cd oldslaviclemma
```

Install the required packages:

```bash
pip install -r requirements.txt
```


### Example: sentence-level input

```python
from inference import load_lemmatizer

lemmatizer = load_lemmatizer("cu_proiel")

sentence = "вѣчьнъі б҃же Ч]ьстьнаго климента законьніка ꙇ мѫченіка чьсті чьстѧце ꙇже ѹтѧже бъиті блаженѹмѹ"

tokens = sentence.split()

lemmas = lemmatizer.lemmatize_sentence(tokens)

for token, lemma in zip(tokens, lemmas):
    print(token, "->", lemma)
```

### Example: Old East Slavic — TOROT

```python
from inference import load_lemmatizer

lemmatizer = load_lemmatizer("orv_torot")

sentence = "мачимъ да чимъ ѿ бедерѧ д҃ мсца моремъ итьти"

tokens = sentence.split()

lemmas = lemmatizer.lemmatize_sentence(tokens)

for token, lemma in zip(tokens, lemmas):
    print(token, "->", lemma)
```

The inference code downloads the model weights, vocabulary, and configuration automatically from the Hugging Face model repository.

## Training and retraining

For training or retraining from Universal Dependencies data, use the GitHub repository:

https://github.com/usmannawaz01/OldSlavicLemma

Example installation from the GitHub repository:

```bash
git clone https://github.com/usmannawaz01/OldSlavicLemma.git
cd OldSlavicLemma

conda create -n oldslaviclemma python=3.10
conda activate oldslaviclemma

pip install -r requirements.txt
pip install -e .
```

Example training command for **Old Church Slavonic — PROIEL**:

```bash
oldslaviclemma-train fit \
  --train UD_Old_Church_Slavonic-PROIEL/cu_proiel-ud-train.conllu \
  --dev UD_Old_Church_Slavonic-PROIEL/cu_proiel-ud-dev.conllu \
  --test UD_Old_Church_Slavonic-PROIEL/cu_proiel-ud-test.conllu \
  --output modelstore/cu_proiel \
  --model-id cu_proiel
```

The training code automatically uses GPU if CUDA is available through PyTorch.

## Data

The pretrained models are trained and evaluated using Universal Dependencies treebanks.

Universal Dependencies data is not required to be stored inside this repository. Users who want to retrain models can download the required `.conllu` files separately and provide their paths with the `--train`, `--dev`, and `--test` arguments.

https://lindat.mff.cuni.cz/repository/items/9142eb95-44f7-442a-923f-0b39da4264fc

## Training and evaluation setting

The main Early Slavic experiments are based on Universal Dependencies treebanks. The model is evaluated under both gold-tokenization and raw-tokenization settings.

In addition to Early Slavic, the approach has been evaluated on a broader Universal Dependencies setup covering:

- **64 languages**
- **115 Universal Dependencies treebanks**
- historical and contemporary language datasets

These multilingual experiments are intended to assess portability beyond the main Early Slavic setting.

## Evaluated languages


- Afrikaans
- Ancient Greek
- Ancient Hebrew
- Arabic
- Armenian
- Basque
- Belarusian
- Bulgarian
- Catalan
- Chinese
- Classical Chinese
- Coptic
- Croatian
- Czech
- Danish
- Dutch
- English
- Estonian
- Finnish
- French
- Galician
- German
- Gothic
- Greek
- Hebrew
- Hindi
- Hungarian
- Icelandic
- Indonesian
- Irish
- Italian
- Japanese
- Korean
- Latin
- Latvian
- Lithuanian
- Maghrebi Arabic
- Marathi
- Naija
- Norwegian
- Old Church Slavonic
- Old East Slavic
- Persian
- Polish
- Pomak
- Portuguese
- Romanian
- Russian
- Scottish Gaelic
- Serbian
- Slovak
- Slovenian
- Spanish
- Swedish
- Tamil
- Turkish
- Turkish German
- Ukrainian
- Urdu
- Uyghur
- Vietnamese
- Welsh
- Western Armenian
- Wolof


## Evaluation settings

### Gold-tokenization setting

Official UD tokenization and sentence boundaries are preserved. The model predicts lemmas for already-tokenized input.

**Metric:** accuracy

### Raw-tokenization setting

Raw input is first tokenized using an external tokenizer, and `OldSlavicLemma` is then used for lemma prediction.

**Metric:** Lemmas F1 score from the official CoNLL-2018 evaluation script

## Intended use

This model can be used for:

- lemmatization of Old Church Slavonic;
- lemmatization of Old East Slavic;
- preprocessing historical Slavic corpora;
- corpus search and lexical normalization;
- philological and linguistic annotation;
- Universal Dependencies-based NLP experiments;
- research on low-resource and historical NLP.

## Limitations

`OldSlavicLemma` is a lemmatizer, not a full NLP pipeline. For raw text, tokenization must be performed separately before lemma prediction.

Performance may vary across languages, scripts, corpus sizes, and annotation conventions. The model is expected to be most useful in settings where inflected forms exhibit reusable character-level stem and suffix transformations.

Although the model is portable across Universal Dependencies treebanks, it should not be interpreted as uniformly optimal for all languages. Dictionary-based, edit-tree-based, or morphologically supervised systems may remain competitive in high-coverage, highly standardized, or typologically different settings.

## License

The pretrained model weights in this repository are released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

The source code for training and inference is available on GitHub and is released under the MIT License.

Commercial use of the pretrained model weights is not permitted without prior permission.

License details: https://creativecommons.org/licenses/by-nc/4.0/

## Citation

A formal citation for `OldSlavicLemma` will be added soon.