Instructions to use marijadjokic/Old-Serb-XLM-RoBERTa with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use marijadjokic/Old-Serb-XLM-RoBERTa with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="marijadjokic/Old-Serb-XLM-RoBERTa")# Load model directly from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("marijadjokic/Old-Serb-XLM-RoBERTa") model = AutoModelForTokenClassification.from_pretrained("marijadjokic/Old-Serb-XLM-RoBERTa") - Notebooks
- Google Colab
- Kaggle
Old-Serb-XLM-RoBERTa: Using XLM-RoBERTa for Named Entity Recognition on Historical Serbian Texts
This model is a fine-tuned version of XLM-RoBERTa-base for Named Entity Recognition (NER) on 16th-century Serbian historical documents. It identifies entities such as persons (PER), locations (LOC), ethnicity (DEMO) in BIO format.
Model Details
- Base Model:
xlm-roberta-base - Task: Token Classification (NER)
- Labels: BIO-tagged entities including PER, LOC, and DEMO. (see
label_config.jsonfor full list) - Training Data: Annotated historical Serbian charters with synthetic data augmentation
- Max Sequence Length: 512 tokens (standard for XLM-RoBERTa)
Installation
Install the required libraries:
pip install transformers torch
Usage
Long texts are processed in overlapping chunks to avoid truncation. This ensures entities spanning chunk boundaries are handled correctly.
import unicodedata
from typing import List, Dict, Tuple
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "MarijaMaja/xlm-roberta-base-finetuned-old-serbian-ner"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
def normalize_text_consistently(text: str) -> str:
s = unicodedata.normalize("NFC", text)
cleaned = []
for ch in s:
cat = unicodedata.category(ch)
if ch == "ⸯ":
continue
if cat in ("Mn", "Cf"):
continue
cleaned.append(ch)
return "".join(cleaned)
def build_word_chunks(
norm_words: List[str],
tokenizer,
max_length: int = 512,
overlap_words: int = 50,
) -> List[Tuple[int, int]]:
chunks = []
n = len(norm_words)
start = 0
while start < n:
low = start + 1
high = n
best_end = start + 1
while low <= high:
mid = (low + high) // 2
piece_len = len(tokenizer(norm_words[start:mid], is_split_into_words=True, truncation=False)["input_ids"])
if piece_len <= max_length:
best_end = mid
low = mid + 1
else:
high = mid - 1
if best_end <= start:
best_end = start + 1
chunks.append((start, best_end))
if best_end >= n:
break
next_start = max(start + 1, best_end - overlap_words)
start = next_start
return chunks
def predict_ner_long_text(
text: str,
tokenizer,
model,
max_length: int = 512,
overlap_words: int = 50,
) -> List[Dict]:
raw_words = text.split()
norm_words = [normalize_text_consistently(w) for w in raw_words]
chunks = build_word_chunks(norm_words, tokenizer, max_length, overlap_words)
global_labels = {i: "O" for i in range(len(raw_words))}
for chunk_start, chunk_end in chunks:
chunk_words = norm_words[chunk_start:chunk_end]
inputs = tokenizer(chunk_words, is_split_into_words=True, return_tensors="pt", truncation=True, max_length=max_length)
word_ids = inputs.word_ids(batch_index=0)
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits.argmax(dim=-1)[0]
local_preds = {}
for token_idx, word_idx in enumerate(word_ids):
if word_idx is not None and word_idx not in local_preds:
local_preds[word_idx] = model.config.id2label[predictions[token_idx].item()]
for local_idx, pred_label in local_preds.items():
global_idx = chunk_start + local_idx
if global_labels[global_idx] == "O" and pred_label != "O":
global_labels[global_idx] = pred_label
results = [{"word": raw_words[i], "label": global_labels[i]} for i in range(len(raw_words))]
return results
# Example usage
long_text = "ꙋгарске велике гоⷭ҇пде наⷨ даде краⷧ҇ матеꙗⷲ҇ за нашꙋ слꙋжбꙋ ꙋ тамиⷲваркⷭои мегѥ моишꙋ и познаⷣ ꙋ бащинꙋ и потоле наⷨ даде краⷧ҇ матеꙗⷲ҇" # Your long historical text
results = predict_ner_long_text(long_text, tokenizer, model)
for item in results:
if item["label"] != "O":
print(f"[{item['label']}] {item['word']}")
Example Output
For a sample text, the model might output:
[B-DEMO] ꙋгарске
[B-LOC] тамиⷲваркⷭои
[B-LOC] моишꙋ
Training Data
The training corpus consists of manually annotated pre-modern Serbian Cyrillic texts, including medieval charters and archival material (Banjska Chrysobull, the third example of the Dečanska Chrysobull, and a collection of 13-century charters and letters from the Dubrovnik archive). Entities were annotated in BIO format with the labels PER, LOC, and DEMO.
Synthetic data augmentation was used to increase variation and improve robustness to historical orthography and domain shift.
Limitations
- Designed for historical Serbian Cyrillic texts; may not perform well on modern Serbian or other languages.
- Chunking with overlap helps with long texts, but may miss entities split across non-overlapping boundaries in rare cases.
- Requires normalization of diacritics and special characters for best results.
Authors
For more details, see the model card on Hugging Face.
- Downloads last month
- 28
Model tree for marijadjokic/Old-Serb-XLM-RoBERTa
Base model
FacebookAI/xlm-roberta-base