--- language: - cu - orv tags: - old-church-slavonic - old-east-slavic - early-slavic - historical-nlp - lemmatization - universal-dependencies - sequence-to-sequence - character-level-modeling - low-resource-languages - pytorch license: cc-by-nc-4.0 library_name: pytorch --- # OldSlavicLemma `OldSlavicLemma` is a neural sequence-to-sequence lemmatizer for historical and low-resource languages, with a primary focus on **Old Church Slavonic (OCS)** and **Old East Slavic (OES)**. The model is designed for texts with rich inflectional morphology, orthographic variation, dialectal variation, manuscript variation, and limited annotated resources. ## Model description `OldSlavicLemma` is a dictionary-free character-level encoder-decoder lemmatizer. It predicts the canonical lemma of a target token using the token together with its local left and right context. Unlike many pipeline-based lemmatizers, the model does **not** require POS tags, morphological features, external dictionaries, or hand-written rules as input. It operates directly over character sequences and generates lemmas character by character. The architecture is based on recurrent sequence modeling with attention mechanisms, making it suitable for historical forms where spelling variation and unseen inflected tokens are common. ## Online demo An online demo is available for interactive lemmatization: https://usmannawaz01.github.io/OldSlavicLemma/ The demo can be used to lemmatize individual words, tokens, or short text. ## GitHub repository Source code and training instructions are available here: https://github.com/usmannawaz01/OldSlavicLemma The GitHub repository contains the training package, command-line interface, and instructions for retraining models on Universal Dependencies treebanks. ## Pretrained models Pretrained `OldSlavicLemma` models are available in this Hugging Face repository. These models are provided to support reuse without retraining. The repository includes pretrained checkpoints for Early Slavic and additional Universal Dependencies treebanks. A typical exported model folder contains: ```text model.pt vocab.json config.json ``` The `.pt` file stores the trained PyTorch weights. The JSON files store the vocabulary, model configuration. ## Available Early Slavic model IDs ```text cu_proiel Old Church Slavonic — PROIEL orv_birchbark Old East Slavic — Birchbark orv_rnc Old East Slavic — RNC orv_torot Old East Slavic — TOROT ``` ## Specialization `OldSlavicLemma` is especially specialized for Early Slavic lemmatization, including: - Old Church Slavonic — PROIEL - Old East Slavic — Birchbark - Old East Slavic — RNC - Old East Slavic — TOROT Early Slavic languages are challenging for standard NLP pipelines because they exhibit complex inflectional morphology, irregular orthography, manuscript variation, dialectal variation, and limited annotated corpora. ## How to use pretrained models in Python The easiest way to use the pretrained models programmatically is to use the inference code from the Hugging Face Space. Clone the Space repository: ```bash git clone https://huggingface.co/spaces/usmannawaz/oldslaviclemma cd oldslaviclemma ``` Install the required packages: ```bash pip install -r requirements.txt ``` ### Example: sentence-level input ```python from inference import load_lemmatizer lemmatizer = load_lemmatizer("cu_proiel") sentence = "вѣчьнъі б҃же Ч]ьстьнаго климента законьніка ꙇ мѫченіка чьсті чьстѧце ꙇже ѹтѧже бъиті блаженѹмѹ" tokens = sentence.split() lemmas = lemmatizer.lemmatize_sentence(tokens) for token, lemma in zip(tokens, lemmas): print(token, "->", lemma) ``` ### Example: Old East Slavic — TOROT ```python from inference import load_lemmatizer lemmatizer = load_lemmatizer("orv_torot") sentence = "мачимъ да чимъ ѿ бедерѧ д҃ мсца моремъ итьти" tokens = sentence.split() lemmas = lemmatizer.lemmatize_sentence(tokens) for token, lemma in zip(tokens, lemmas): print(token, "->", lemma) ``` The inference code downloads the model weights, vocabulary, and configuration automatically from the Hugging Face model repository. ## Training and retraining For training or retraining from Universal Dependencies data, use the GitHub repository: https://github.com/usmannawaz01/OldSlavicLemma Example installation from the GitHub repository: ```bash git clone https://github.com/usmannawaz01/OldSlavicLemma.git cd OldSlavicLemma conda create -n oldslaviclemma python=3.10 conda activate oldslaviclemma pip install -r requirements.txt pip install -e . ``` Example training command for **Old Church Slavonic — PROIEL**: ```bash oldslaviclemma-train fit \ --train UD_Old_Church_Slavonic-PROIEL/cu_proiel-ud-train.conllu \ --dev UD_Old_Church_Slavonic-PROIEL/cu_proiel-ud-dev.conllu \ --test UD_Old_Church_Slavonic-PROIEL/cu_proiel-ud-test.conllu \ --output modelstore/cu_proiel \ --model-id cu_proiel ``` The training code automatically uses GPU if CUDA is available through PyTorch. ## Data The pretrained models are trained and evaluated using Universal Dependencies treebanks. Universal Dependencies data is not required to be stored inside this repository. Users who want to retrain models can download the required `.conllu` files separately and provide their paths with the `--train`, `--dev`, and `--test` arguments. https://lindat.mff.cuni.cz/repository/items/9142eb95-44f7-442a-923f-0b39da4264fc ## Training and evaluation setting The main Early Slavic experiments are based on Universal Dependencies treebanks. The model is evaluated under both gold-tokenization and raw-tokenization settings. In addition to Early Slavic, the approach has been evaluated on a broader Universal Dependencies setup covering: - **64 languages** - **115 Universal Dependencies treebanks** - historical and contemporary language datasets These multilingual experiments are intended to assess portability beyond the main Early Slavic setting. ## Evaluated languages - Afrikaans - Ancient Greek - Ancient Hebrew - Arabic - Armenian - Basque - Belarusian - Bulgarian - Catalan - Chinese - Classical Chinese - Coptic - Croatian - Czech - Danish - Dutch - English - Estonian - Finnish - French - Galician - German - Gothic - Greek - Hebrew - Hindi - Hungarian - Icelandic - Indonesian - Irish - Italian - Japanese - Korean - Latin - Latvian - Lithuanian - Maghrebi Arabic - Marathi - Naija - Norwegian - Old Church Slavonic - Old East Slavic - Persian - Polish - Pomak - Portuguese - Romanian - Russian - Scottish Gaelic - Serbian - Slovak - Slovenian - Spanish - Swedish - Tamil - Turkish - Turkish German - Ukrainian - Urdu - Uyghur - Vietnamese - Welsh - Western Armenian - Wolof ## Evaluation settings ### Gold-tokenization setting Official UD tokenization and sentence boundaries are preserved. The model predicts lemmas for already-tokenized input. **Metric:** accuracy ### Raw-tokenization setting Raw input is first tokenized using an external tokenizer, and `OldSlavicLemma` is then used for lemma prediction. **Metric:** Lemmas F1 score from the official CoNLL-2018 evaluation script ## Intended use This model can be used for: - lemmatization of Old Church Slavonic; - lemmatization of Old East Slavic; - preprocessing historical Slavic corpora; - corpus search and lexical normalization; - philological and linguistic annotation; - Universal Dependencies-based NLP experiments; - research on low-resource and historical NLP. ## Limitations `OldSlavicLemma` is a lemmatizer, not a full NLP pipeline. For raw text, tokenization must be performed separately before lemma prediction. Performance may vary across languages, scripts, corpus sizes, and annotation conventions. The model is expected to be most useful in settings where inflected forms exhibit reusable character-level stem and suffix transformations. Although the model is portable across Universal Dependencies treebanks, it should not be interpreted as uniformly optimal for all languages. Dictionary-based, edit-tree-based, or morphologically supervised systems may remain competitive in high-coverage, highly standardized, or typologically different settings. ## License The pretrained model weights in this repository are released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). The source code for training and inference is available on GitHub and is released under the MIT License. Commercial use of the pretrained model weights is not permitted without prior permission. License details: https://creativecommons.org/licenses/by-nc/4.0/ ## Citation A formal citation for `OldSlavicLemma` will be added soon.