EuroVoc Domain Classifier (bert-base-uncased)

A fine-tuned BERT model that classifies EU legislation into the 21 top-level EuroVoc thematic domains.

What it does

Given the preamble of an EU regulation, decision, or directive, the model predicts which of the 21 EuroVoc domains apply (multi-label classification). For example, a regulation about carbon border adjustments might be classified under Energy, Trade, and Environment.

Performance

Metric Score
F1 micro 0.900
F1 macro 0.800
Optimal threshold 0.40

Evaluated on 890 held-out EU regulations published between September 2025 and March 2026. Ground truth labels were assigned by professional librarians at the EU Publications Office.

Context: Series Results

Method F1 (micro) Cost
bert-base (this model) 0.900 €10
Llama 3.1 8B (QLoRA) 0.892 €83
EUBERT (fine-tuned) 0.891 free
TF-IDF + Logistic Regression 0.799 free
DeepSeek-R1-70B (zero-shot) 0.562 ~€12

Full write-up: Pimp My LM: A Fine-Tuning Tale of Bling and Basic

The 21 Labels

AGRI-FOODSTUFFS, AGRICULTURE FORESTRY AND FISHERIES,
BUSINESS AND COMPETITION, ECONOMICS, EDUCATION AND COMMUNICATIONS,
EMPLOYMENT AND WORKING CONDITIONS, ENERGY, ENVIRONMENT,
EUROPEAN UNION, FINANCE, GEOGRAPHY, INDUSTRY,
INTERNATIONAL ORGANISATIONS, INTERNATIONAL RELATIONS, LAW,
POLITICS, PRODUCTION TECHNOLOGY AND RESEARCH, SCIENCE,
SOCIAL QUESTIONS, TRADE, TRANSPORT

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "jngb-labs/eurovoc-bert-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

text = "Regulation establishing a carbon border adjustment mechanism..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.sigmoid(logits)

LABELS = [
    "AGRI-FOODSTUFFS", "AGRICULTURE, FORESTRY AND FISHERIES",
    "BUSINESS AND COMPETITION", "ECONOMICS", "EDUCATION AND COMMUNICATIONS",
    "EMPLOYMENT AND WORKING CONDITIONS", "ENERGY", "ENVIRONMENT",
    "EUROPEAN UNION", "FINANCE", "GEOGRAPHY", "INDUSTRY",
    "INTERNATIONAL ORGANISATIONS", "INTERNATIONAL RELATIONS", "LAW",
    "POLITICS", "PRODUCTION, TECHNOLOGY AND RESEARCH", "SCIENCE",
    "SOCIAL QUESTIONS", "TRADE", "TRANSPORT"
]

threshold = 0.40
predictions = [LABELS[i] for i, p in enumerate(probs[0]) if p > threshold]
print(predictions)

Training Details

  • Base model: bert-base-uncased (110M parameters)
  • Training data: 63,918 EU regulations (preambles), sourced from EUR-Lex via CELLAR API
  • Test data: 890 held-out regulations, labels assigned by EU Publications Office librarians
  • Architecture: BERT pooler → Dropout(0.1) → Linear(768, 21) → Sigmoid
  • Loss: BCEWithLogitsLoss
  • Epochs: 3
  • Hardware: Nvidia T4 (Google Colab)
  • Training time: 162 minutes
  • Cost: €10 (Colab compute units)

Limitations

  • Trained on English-language preambles only (EU legislation is published in 24 languages)
  • Multi-label threshold (0.40) was optimised on the test set; may need adjustment for other corpora
  • Classification granularity is limited to 21 top-level domains; finer EuroVoc concepts are not predicted

License

Apache 2.0

Author

Jakob Neugebauer

Downloads last month
5
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jngb-labs/eurovoc-bert-base

Finetuned
(6774)
this model

Evaluation results