OldSlavicLemma

OldSlavicLemma is a neural lemmatizer designed for Early Slavic languages, especially Old Church Slavonic (OCS) and Old East Slavic (OES).

The model is intended for historical NLP, philological research, corpus annotation, lexical normalization, and the processing of morphologically rich ancient languages.

Model description

OldSlavicLemma is a dictionary-free character-level sequence-to-sequence lemmatizer.
It maps an inflected token to its canonical lemma by using the target word together with its surrounding context.

Unlike many traditional lemmatizers, the model does not require POS tags, morphological features, external dictionaries, or hand-written rules. It operates directly on character sequences.

Specialization

This model is especially specialized for Early Slavic languages, including:

  • Old Church Slavonic
  • Old East Slavic

Early Slavic languages are challenging for standard NLP pipelines because they contain rich inflectional morphology, irregular orthography, dialectal variation, manuscript variation, and limited annotated corpora.

Training data and models

The main Early Slavic models was trained on the Universal Dependencies treebank.

In addition to the specialized Early Slavic model, we evaluated the approach on a large multilingual Universal Dependencies setup covering:

  • 115 Universal Dependencies treebanks
  • 60+ languages
  • historical and modern language datasets

We also provide pretrained models for multiple UD treebanks and languages, allowing users to apply the lemmatizer beyond Early Slavic.

Evaluation

OldSlavicLemma achieves strong results on Early Slavic UD treebanks.

On UD v2.12 Early Slavic datasets, the model outperforms:

  • Old Church Slavonic PROIEL
  • Old East Slavic Birchbark
  • Old East Slavic RNC
  • Old East Slavic TOROT

The model also shows strong performance on UD v2.15 Early Slavic treebanks, including:

  • PROIEL
  • TOROT
  • Birchbark
  • RNC
  • Ruthenian

Intended use

This model can be used for:

  • lemmatization of Old Church Slavonic
  • lemmatization of Old East Slavic
  • preprocessing historical Slavic corpora
  • corpus search and lexical normalization
  • philological and linguistic annotation
  • Universal Dependencies-based NLP pipelines
  • experiments on low-resource and historical languages

Example use

# Example usage will be added soon.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using usmannawaz/oldslaviclemma 1