language:
- cu
- orv
tags:
- old-church-slavonic
- old-east-slavic
- early-slavic
- historical-nlp
- lemmatization
- universal-dependencies
- sequence-to-sequence
- character-level-modeling
- low-resource-languages
- pytorch
license: cc-by-nc-4.0
library_name: pytorch
OldSlavicLemma
OldSlavicLemma is a neural sequence-to-sequence lemmatizer for historical and low-resource languages, with a primary focus on Old Church Slavonic (OCS) and Old East Slavic (OES).
The model is designed for texts with rich inflectional morphology, orthographic variation, dialectal variation, manuscript variation, and limited annotated resources.
Model description
OldSlavicLemma is a dictionary-free character-level encoder-decoder lemmatizer. It predicts the canonical lemma of a target token using the token together with its local left and right context.
Unlike many pipeline-based lemmatizers, the model does not require POS tags, morphological features, external dictionaries, or hand-written rules as input. It operates directly over character sequences and generates lemmas character by character.
The architecture is based on recurrent sequence modeling with attention mechanisms, making it suitable for historical forms where spelling variation and unseen inflected tokens are common.
Online demo
An online demo is available for interactive lemmatization:
https://usmannawaz01.github.io/OldSlavicLemma/
The demo can be used to lemmatize individual words, tokens, or short text.
GitHub repository
Source code and training instructions are available here:
https://github.com/usmannawaz01/OldSlavicLemma
The GitHub repository contains the training package, command-line interface, and instructions for retraining models on Universal Dependencies treebanks.
Pretrained models
Pretrained OldSlavicLemma models are available in this Hugging Face repository.
These models are provided to support reuse without retraining. The repository includes pretrained checkpoints for Early Slavic and additional Universal Dependencies treebanks.
A typical exported model folder contains:
model.pt
vocab.json
config.json
The .pt file stores the trained PyTorch weights. The JSON files store the vocabulary, model configuration.
Available Early Slavic model IDs
cu_proiel Old Church Slavonic — PROIEL
orv_birchbark Old East Slavic — Birchbark
orv_rnc Old East Slavic — RNC
orv_torot Old East Slavic — TOROT
Specialization
OldSlavicLemma is especially specialized for Early Slavic lemmatization, including:
- Old Church Slavonic — PROIEL
- Old East Slavic — Birchbark
- Old East Slavic — RNC
- Old East Slavic — TOROT
Early Slavic languages are challenging for standard NLP pipelines because they exhibit complex inflectional morphology, irregular orthography, manuscript variation, dialectal variation, and limited annotated corpora.
How to use pretrained models in Python
The easiest way to use the pretrained models programmatically is to use the inference code from the Hugging Face Space.
Clone the Space repository:
git clone https://huggingface.co/spaces/usmannawaz/oldslaviclemma
cd oldslaviclemma
Install the required packages:
pip install -r requirements.txt
Example: sentence-level input
from inference import load_lemmatizer
lemmatizer = load_lemmatizer("cu_proiel")
sentence = "вѣчьнъі б҃же Ч]ьстьнаго климента законьніка ꙇ мѫченіка чьсті чьстѧце ꙇже ѹтѧже бъиті блаженѹмѹ"
tokens = sentence.split()
lemmas = lemmatizer.lemmatize_sentence(tokens)
for token, lemma in zip(tokens, lemmas):
print(token, "->", lemma)
Example: Old East Slavic — TOROT
from inference import load_lemmatizer
lemmatizer = load_lemmatizer("orv_torot")
sentence = "мачимъ да чимъ ѿ бедерѧ д҃ мсца моремъ итьти"
tokens = sentence.split()
lemmas = lemmatizer.lemmatize_sentence(tokens)
for token, lemma in zip(tokens, lemmas):
print(token, "->", lemma)
The inference code downloads the model weights, vocabulary, and configuration automatically from the Hugging Face model repository.
Training and retraining
For training or retraining from Universal Dependencies data, use the GitHub repository:
https://github.com/usmannawaz01/OldSlavicLemma
Example installation from the GitHub repository:
git clone https://github.com/usmannawaz01/OldSlavicLemma.git
cd OldSlavicLemma
conda create -n oldslaviclemma python=3.10
conda activate oldslaviclemma
pip install -r requirements.txt
pip install -e .
Example training command for Old Church Slavonic — PROIEL:
oldslaviclemma-train fit \
--train UD_Old_Church_Slavonic-PROIEL/cu_proiel-ud-train.conllu \
--dev UD_Old_Church_Slavonic-PROIEL/cu_proiel-ud-dev.conllu \
--test UD_Old_Church_Slavonic-PROIEL/cu_proiel-ud-test.conllu \
--output modelstore/cu_proiel \
--model-id cu_proiel
The training code automatically uses GPU if CUDA is available through PyTorch.
Data
The pretrained models are trained and evaluated using Universal Dependencies treebanks.
Universal Dependencies data is not required to be stored inside this repository. Users who want to retrain models can download the required .conllu files separately and provide their paths with the --train, --dev, and --test arguments.
https://lindat.mff.cuni.cz/repository/items/9142eb95-44f7-442a-923f-0b39da4264fc
Training and evaluation setting
The main Early Slavic experiments are based on Universal Dependencies treebanks. The model is evaluated under both gold-tokenization and raw-tokenization settings.
In addition to Early Slavic, the approach has been evaluated on a broader Universal Dependencies setup covering:
- 64 languages
- 115 Universal Dependencies treebanks
- historical and contemporary language datasets
These multilingual experiments are intended to assess portability beyond the main Early Slavic setting.
Evaluated languages
- Afrikaans
- Ancient Greek
- Ancient Hebrew
- Arabic
- Armenian
- Basque
- Belarusian
- Bulgarian
- Catalan
- Chinese
- Classical Chinese
- Coptic
- Croatian
- Czech
- Danish
- Dutch
- English
- Estonian
- Finnish
- French
- Galician
- German
- Gothic
- Greek
- Hebrew
- Hindi
- Hungarian
- Icelandic
- Indonesian
- Irish
- Italian
- Japanese
- Korean
- Latin
- Latvian
- Lithuanian
- Maghrebi Arabic
- Marathi
- Naija
- Norwegian
- Old Church Slavonic
- Old East Slavic
- Persian
- Polish
- Pomak
- Portuguese
- Romanian
- Russian
- Scottish Gaelic
- Serbian
- Slovak
- Slovenian
- Spanish
- Swedish
- Tamil
- Turkish
- Turkish German
- Ukrainian
- Urdu
- Uyghur
- Vietnamese
- Welsh
- Western Armenian
- Wolof
Evaluation settings
Gold-tokenization setting
Official UD tokenization and sentence boundaries are preserved. The model predicts lemmas for already-tokenized input.
Metric: accuracy
Raw-tokenization setting
Raw input is first tokenized using an external tokenizer, and OldSlavicLemma is then used for lemma prediction.
Metric: Lemmas F1 score from the official CoNLL-2018 evaluation script
Intended use
This model can be used for:
- lemmatization of Old Church Slavonic;
- lemmatization of Old East Slavic;
- preprocessing historical Slavic corpora;
- corpus search and lexical normalization;
- philological and linguistic annotation;
- Universal Dependencies-based NLP experiments;
- research on low-resource and historical NLP.
Limitations
OldSlavicLemma is a lemmatizer, not a full NLP pipeline. For raw text, tokenization must be performed separately before lemma prediction.
Performance may vary across languages, scripts, corpus sizes, and annotation conventions. The model is expected to be most useful in settings where inflected forms exhibit reusable character-level stem and suffix transformations.
Although the model is portable across Universal Dependencies treebanks, it should not be interpreted as uniformly optimal for all languages. Dictionary-based, edit-tree-based, or morphologically supervised systems may remain competitive in high-coverage, highly standardized, or typologically different settings.
License
The pretrained model weights in this repository are released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).
The source code for training and inference is available on GitHub and is released under the MIT License.
Commercial use of the pretrained model weights is not permitted without prior permission.
License details: https://creativecommons.org/licenses/by-nc/4.0/
Citation
A formal citation for OldSlavicLemma will be added soon.