Update README.md

acb7a05 verified 5 days ago

8.84 kB

language:
  - cu
  - orv
tags:
  - old-church-slavonic
  - old-east-slavic
  - early-slavic
  - historical-nlp
  - lemmatization
  - universal-dependencies
  - sequence-to-sequence
  - character-level-modeling
  - low-resource-languages
  - pytorch
license: cc-by-nc-4.0
library_name: pytorch

OldSlavicLemma

OldSlavicLemma is a neural sequence-to-sequence lemmatizer for historical and low-resource languages, with a primary focus on Old Church Slavonic (OCS) and Old East Slavic (OES).

The model is designed for texts with rich inflectional morphology, orthographic variation, dialectal variation, manuscript variation, and limited annotated resources.

Model description

OldSlavicLemma is a dictionary-free character-level encoder-decoder lemmatizer. It predicts the canonical lemma of a target token using the token together with its local left and right context.

Unlike many pipeline-based lemmatizers, the model does not require POS tags, morphological features, external dictionaries, or hand-written rules as input. It operates directly over character sequences and generates lemmas character by character.

The architecture is based on recurrent sequence modeling with attention mechanisms, making it suitable for historical forms where spelling variation and unseen inflected tokens are common.

Online demo

An online demo is available for interactive lemmatization:

https://usmannawaz01.github.io/OldSlavicLemma/

The demo can be used to lemmatize individual words, tokens, or short text.

GitHub repository

Source code and training instructions are available here:

https://github.com/usmannawaz01/OldSlavicLemma

The GitHub repository contains the training package, command-line interface, and instructions for retraining models on Universal Dependencies treebanks.

Pretrained models

Pretrained OldSlavicLemma models are available in this Hugging Face repository.

These models are provided to support reuse without retraining. The repository includes pretrained checkpoints for Early Slavic and additional Universal Dependencies treebanks.

A typical exported model folder contains:

model.pt
vocab.json
config.json

The .pt file stores the trained PyTorch weights. The JSON files store the vocabulary, model configuration.

Available Early Slavic model IDs

cu_proiel       Old Church Slavonic — PROIEL
orv_birchbark   Old East Slavic — Birchbark
orv_rnc         Old East Slavic — RNC
orv_torot       Old East Slavic — TOROT

Specialization

OldSlavicLemma is especially specialized for Early Slavic lemmatization, including:

Old Church Slavonic — PROIEL
Old East Slavic — Birchbark
Old East Slavic — RNC
Old East Slavic — TOROT

Early Slavic languages are challenging for standard NLP pipelines because they exhibit complex inflectional morphology, irregular orthography, manuscript variation, dialectal variation, and limited annotated corpora.

How to use pretrained models in Python

The easiest way to use the pretrained models programmatically is to use the inference code from the Hugging Face Space.

Clone the Space repository:

git clone https://huggingface.co/spaces/usmannawaz/oldslaviclemma
cd oldslaviclemma

Install the required packages:

pip install -r requirements.txt

Example: sentence-level input

from inference import load_lemmatizer

lemmatizer = load_lemmatizer("cu_proiel")

sentence = "вѣчьнъі б҃же Ч]ьстьнаго климента законьніка ꙇ мѫченіка чьсті чьстѧце ꙇже ѹтѧже бъиті блаженѹмѹ"

tokens = sentence.split()

lemmas = lemmatizer.lemmatize_sentence(tokens)

for token, lemma in zip(tokens, lemmas):
    print(token, "->", lemma)

Example: Old East Slavic — TOROT

from inference import load_lemmatizer

lemmatizer = load_lemmatizer("orv_torot")

sentence = "мачимъ да чимъ ѿ бедерѧ д҃ мсца моремъ итьти"

tokens = sentence.split()

lemmas = lemmatizer.lemmatize_sentence(tokens)

for token, lemma in zip(tokens, lemmas):
    print(token, "->", lemma)

The inference code downloads the model weights, vocabulary, and configuration automatically from the Hugging Face model repository.

Training and retraining

For training or retraining from Universal Dependencies data, use the GitHub repository:

https://github.com/usmannawaz01/OldSlavicLemma

Example installation from the GitHub repository:

git clone https://github.com/usmannawaz01/OldSlavicLemma.git
cd OldSlavicLemma

conda create -n oldslaviclemma python=3.10
conda activate oldslaviclemma

pip install -r requirements.txt
pip install -e .

Example training command for Old Church Slavonic — PROIEL:

oldslaviclemma-train fit \
  --train UD_Old_Church_Slavonic-PROIEL/cu_proiel-ud-train.conllu \
  --dev UD_Old_Church_Slavonic-PROIEL/cu_proiel-ud-dev.conllu \
  --test UD_Old_Church_Slavonic-PROIEL/cu_proiel-ud-test.conllu \
  --output modelstore/cu_proiel \
  --model-id cu_proiel

The training code automatically uses GPU if CUDA is available through PyTorch.

Data

The pretrained models are trained and evaluated using Universal Dependencies treebanks.

Universal Dependencies data is not required to be stored inside this repository. Users who want to retrain models can download the required .conllu files separately and provide their paths with the --train, --dev, and --test arguments.

https://lindat.mff.cuni.cz/repository/items/9142eb95-44f7-442a-923f-0b39da4264fc

Training and evaluation setting

The main Early Slavic experiments are based on Universal Dependencies treebanks. The model is evaluated under both gold-tokenization and raw-tokenization settings.

In addition to Early Slavic, the approach has been evaluated on a broader Universal Dependencies setup covering:

64 languages
115 Universal Dependencies treebanks
historical and contemporary language datasets

These multilingual experiments are intended to assess portability beyond the main Early Slavic setting.

Evaluated languages

Afrikaans
Ancient Greek
Ancient Hebrew
Arabic
Armenian
Basque
Belarusian
Bulgarian
Catalan
Chinese
Classical Chinese
Coptic
Croatian
Czech
Danish
Dutch
English
Estonian
Finnish
French
Galician
German
Gothic
Greek
Hebrew
Hindi
Hungarian
Icelandic
Indonesian
Irish
Italian
Japanese
Korean
Latin
Latvian
Lithuanian
Maghrebi Arabic
Marathi
Naija
Norwegian
Old Church Slavonic
Old East Slavic
Persian
Polish
Pomak
Portuguese
Romanian
Russian
Scottish Gaelic
Serbian
Slovak
Slovenian
Spanish
Swedish
Tamil
Turkish
Turkish German
Ukrainian
Urdu
Uyghur
Vietnamese
Welsh
Western Armenian
Wolof

Evaluation settings

Gold-tokenization setting

Official UD tokenization and sentence boundaries are preserved. The model predicts lemmas for already-tokenized input.

Metric: accuracy

Raw-tokenization setting

Raw input is first tokenized using an external tokenizer, and OldSlavicLemma is then used for lemma prediction.

Metric: Lemmas F1 score from the official CoNLL-2018 evaluation script

Intended use

This model can be used for:

lemmatization of Old Church Slavonic;
lemmatization of Old East Slavic;
preprocessing historical Slavic corpora;
corpus search and lexical normalization;
philological and linguistic annotation;
Universal Dependencies-based NLP experiments;
research on low-resource and historical NLP.

Limitations

OldSlavicLemma is a lemmatizer, not a full NLP pipeline. For raw text, tokenization must be performed separately before lemma prediction.

Performance may vary across languages, scripts, corpus sizes, and annotation conventions. The model is expected to be most useful in settings where inflected forms exhibit reusable character-level stem and suffix transformations.

Although the model is portable across Universal Dependencies treebanks, it should not be interpreted as uniformly optimal for all languages. Dictionary-based, edit-tree-based, or morphologically supervised systems may remain competitive in high-coverage, highly standardized, or typologically different settings.

License

The pretrained model weights in this repository are released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0).

The source code for training and inference is available on GitHub and is released under the MIT License.

Commercial use of the pretrained model weights is not permitted without prior permission.

License details: https://creativecommons.org/licenses/by-nc/4.0/

Citation

A formal citation for OldSlavicLemma will be added soon.