Old-Serb-XLM-RoBERTa: Using XLM-RoBERTa for Named Entity Recognition on Historical Serbian Texts

This model is a fine-tuned version of XLM-RoBERTa-base for Named Entity Recognition (NER) on 16th-century Serbian historical documents. It identifies entities such as persons (PER), locations (LOC), ethnicity (DEMO) in BIO format.

Model Details

  • Base Model: xlm-roberta-base
  • Task: Token Classification (NER)
  • Labels: BIO-tagged entities including PER, LOC, and DEMO. (see label_config.json for full list)
  • Training Data: Annotated historical Serbian charters with synthetic data augmentation
  • Max Sequence Length: 512 tokens (standard for XLM-RoBERTa)

Installation

Install the required libraries:

pip install transformers torch

Usage

Long texts are processed in overlapping chunks to avoid truncation. This ensures entities spanning chunk boundaries are handled correctly.

import unicodedata
from typing import List, Dict, Tuple
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "MarijaMaja/xlm-roberta-base-finetuned-old-serbian-ner" 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)

def normalize_text_consistently(text: str) -> str:
    s = unicodedata.normalize("NFC", text)
    cleaned = []
    for ch in s:
        cat = unicodedata.category(ch)
        if ch == "ⸯ":
            continue
        if cat in ("Mn", "Cf"):
            continue
        cleaned.append(ch)
    return "".join(cleaned)

def build_word_chunks(
    norm_words: List[str],
    tokenizer,
    max_length: int = 512,
    overlap_words: int = 50,
) -> List[Tuple[int, int]]:
    chunks = []
    n = len(norm_words)
    start = 0
    while start < n:
        low = start + 1
        high = n
        best_end = start + 1
        while low <= high:
            mid = (low + high) // 2
            piece_len = len(tokenizer(norm_words[start:mid], is_split_into_words=True, truncation=False)["input_ids"])
            if piece_len <= max_length:
                best_end = mid
                low = mid + 1
            else:
                high = mid - 1
        if best_end <= start:
            best_end = start + 1
        chunks.append((start, best_end))
        if best_end >= n:
            break
        next_start = max(start + 1, best_end - overlap_words)
        start = next_start
    return chunks

def predict_ner_long_text(
    text: str,
    tokenizer,
    model,
    max_length: int = 512,
    overlap_words: int = 50,
) -> List[Dict]:
    raw_words = text.split()
    norm_words = [normalize_text_consistently(w) for w in raw_words]
    chunks = build_word_chunks(norm_words, tokenizer, max_length, overlap_words)
    global_labels = {i: "O" for i in range(len(raw_words))}
    for chunk_start, chunk_end in chunks:
        chunk_words = norm_words[chunk_start:chunk_end]
        inputs = tokenizer(chunk_words, is_split_into_words=True, return_tensors="pt", truncation=True, max_length=max_length)
        word_ids = inputs.word_ids(batch_index=0)
        with torch.no_grad():
            outputs = model(**inputs)
            predictions = outputs.logits.argmax(dim=-1)[0]
        local_preds = {}
        for token_idx, word_idx in enumerate(word_ids):
            if word_idx is not None and word_idx not in local_preds:
                local_preds[word_idx] = model.config.id2label[predictions[token_idx].item()]
        for local_idx, pred_label in local_preds.items():
            global_idx = chunk_start + local_idx
            if global_labels[global_idx] == "O" and pred_label != "O":
                global_labels[global_idx] = pred_label
    results = [{"word": raw_words[i], "label": global_labels[i]} for i in range(len(raw_words))]
    return results

# Example usage
long_text = "ꙋгарске велике гоⷭ҇пде наⷨ даде краⷧ҇ матеꙗⷲ҇ за нашꙋ слꙋжбꙋ ꙋ тамиⷲваркⷭои мегѥ моишꙋ и познаⷣ ꙋ бащинꙋ и потоле наⷨ даде краⷧ҇ матеꙗⷲ҇"  # Your long historical text

results = predict_ner_long_text(long_text, tokenizer, model)
for item in results:
    if item["label"] != "O":
        print(f"[{item['label']}] {item['word']}")

Example Output

For a sample text, the model might output:

[B-DEMO] ꙋгарске
[B-LOC] тамиⷲваркⷭои
[B-LOC] моишꙋ

Training Data

The training corpus consists of manually annotated pre-modern Serbian Cyrillic texts, including medieval charters and archival material (Banjska Chrysobull, the third example of the Dečanska Chrysobull, and a collection of 13-century charters and letters from the Dubrovnik archive). Entities were annotated in BIO format with the labels PER, LOC, and DEMO.

Synthetic data augmentation was used to increase variation and improve robustness to historical orthography and domain shift.

Limitations

  • Designed for historical Serbian Cyrillic texts; may not perform well on modern Serbian or other languages.
  • Chunking with overlap helps with long texts, but may miss entities split across non-overlapping boundaries in rare cases.
  • Requires normalization of diacritics and special characters for best results.

Authors

For more details, see the model card on Hugging Face.

Downloads last month
28
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for marijadjokic/Old-Serb-XLM-RoBERTa

Finetuned
(4051)
this model

Dataset used to train marijadjokic/Old-Serb-XLM-RoBERTa

Space using marijadjokic/Old-Serb-XLM-RoBERTa 1