EuroVoc Domain Classifier (bert-base-uncased)

A fine-tuned BERT model that classifies EU legislation into the 21 top-level EuroVoc thematic domains.

What it does

Given the preamble of an EU regulation, decision, or directive, the model predicts which of the 21 EuroVoc domains apply (multi-label classification). For example, a regulation about carbon border adjustments might be classified under Energy, Trade, and Environment.

Performance

Metric	Score
F1 micro	0.900
F1 macro	0.800
Optimal threshold	0.40

Evaluated on 890 held-out EU regulations published between September 2025 and March 2026. Ground truth labels were assigned by professional librarians at the EU Publications Office.

Context: Series Results

Method	F1 (micro)	Cost
bert-base (this model)	0.900	€10
Llama 3.1 8B (QLoRA)	0.892	€83
EUBERT (fine-tuned)	0.891	free
TF-IDF + Logistic Regression	0.799	free
DeepSeek-R1-70B (zero-shot)	0.562	~€12

Full write-up: Pimp My LM: A Fine-Tuning Tale of Bling and Basic

The 21 Labels

AGRI-FOODSTUFFS, AGRICULTURE FORESTRY AND FISHERIES,
BUSINESS AND COMPETITION, ECONOMICS, EDUCATION AND COMMUNICATIONS,
EMPLOYMENT AND WORKING CONDITIONS, ENERGY, ENVIRONMENT,
EUROPEAN UNION, FINANCE, GEOGRAPHY, INDUSTRY,
INTERNATIONAL ORGANISATIONS, INTERNATIONAL RELATIONS, LAW,
POLITICS, PRODUCTION TECHNOLOGY AND RESEARCH, SCIENCE,
SOCIAL QUESTIONS, TRADE, TRANSPORT

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "jngb-labs/eurovoc-bert-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

text = "Regulation establishing a carbon border adjustment mechanism..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    logits = model(**inputs).logits
    probs = torch.sigmoid(logits)

LABELS = [
    "AGRI-FOODSTUFFS", "AGRICULTURE, FORESTRY AND FISHERIES",
    "BUSINESS AND COMPETITION", "ECONOMICS", "EDUCATION AND COMMUNICATIONS",
    "EMPLOYMENT AND WORKING CONDITIONS", "ENERGY", "ENVIRONMENT",
    "EUROPEAN UNION", "FINANCE", "GEOGRAPHY", "INDUSTRY",
    "INTERNATIONAL ORGANISATIONS", "INTERNATIONAL RELATIONS", "LAW",
    "POLITICS", "PRODUCTION, TECHNOLOGY AND RESEARCH", "SCIENCE",
    "SOCIAL QUESTIONS", "TRADE", "TRANSPORT"
]

threshold = 0.40
predictions = [LABELS[i] for i, p in enumerate(probs[0]) if p > threshold]
print(predictions)

Training Details

Base model: bert-base-uncased (110M parameters)
Training data: 63,918 EU regulations (preambles), sourced from EUR-Lex via CELLAR API
Test data: 890 held-out regulations, labels assigned by EU Publications Office librarians
Architecture: BERT pooler → Dropout(0.1) → Linear(768, 21) → Sigmoid
Loss: BCEWithLogitsLoss
Epochs: 3
Hardware: Nvidia T4 (Google Colab)
Training time: 162 minutes
Cost: €10 (Colab compute units)

Limitations

Trained on English-language preambles only (EU legislation is published in 24 languages)
Multi-label threshold (0.40) was optimised on the test set; may need adjustment for other corpora
Classification granularity is limited to 21 top-level domains; finer EuroVoc concepts are not predicted

License

Apache 2.0

Author

Jakob Neugebauer

Downloads last month: 5

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for jngb-labs/eurovoc-bert-base

Base model

google-bert/bert-base-uncased

Finetuned

(6774)

this model

Evaluation results

F1 (micro)
self-reported

0.900
F1 (macro)
self-reported

0.800