--- language: - fr license: apache-2.0 library_name: transformers pipeline_tag: fill-mask tags: - medical - healthcare - biomedical - clinical - french - modernbert - long-context - fill-mask - doctobert datasets: - doctolib-lab/finemed-fr - doctolib-lab/finemed-rephrased-fr --- # DoctoModernBERT-fr-base

🤗 Blog | 📄 Paper | 💻 Code | 🌐 FineMed | 🩺 DoctoBERT

## 📚 Introduction **DoctoModernBERT-fr-base** is a French medical encoder for biomedical and clinical NLP. It uses the **ModernBERT** architecture (149M parameters, up to 8192-token context) and is pretrained from scratch on [FineMed-fr](https://huggingface.co/datasets/doctolib-lab/finemed-fr) and [FineMed-rephrased-fr](https://huggingface.co/datasets/doctolib-lab/finemed-rephrased-fr). DoctoModernBERT performs best on a real-world proprietary clinical NER task and ranks among the top encoders on the academic DrBenchmark. Its training data is curated from heterogeneous open web corpora (FineWeb-2, FinePDFs, FineWiki), which bring scale, source and stylistic diversity, and inherited quality control. Each document is annotated along three axes: subdomain, educational quality, and medical-term density. DoctoModernBERT then trains on a mix of two parts: the filtered high-quality, entity-rich documents, and LLM-rephrased variants that raise medical-term density and vary the contexts around medical terms across many genres, audiences, and registers. This breadth and density build robustness to real-world clinical text, which is often noisy and inconsistent in style. For a classic RoBERTa encoder, see its sibling [DoctoBERT-fr-base](https://huggingface.co/doctolib-lab/doctobert-fr-base). ## 🚀 How to Use Requires a recent `transformers` with ModernBERT support. ### Fill-mask Using `AutoModelForMaskedLM`: ```python from transformers import AutoTokenizer, AutoModelForMaskedLM model_id = "doctolib-lab/doctomodernbert-fr-base" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForMaskedLM.from_pretrained(model_id) text = f"Le patient souffre d'une {tokenizer.mask_token} aiguë." inputs = tokenizer(text, return_tensors="pt") logits = model(**inputs).logits masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id) print(tokenizer.decode(logits[0, masked_index].argmax(-1))) ``` Using a `pipeline`: ```python from transformers import pipeline fill = pipeline("fill-mask", model="doctolib-lab/doctomodernbert-fr-base") print(fill(f"Le patient souffre d'une {fill.tokenizer.mask_token} aiguë.")) ``` For long inputs, load with `attn_implementation="flash_attention_2"` for faster, more memory-efficient attention. ### Fine-tuning DoctoModernBERT fine-tunes like any BERT/ModernBERT encoder, with the appropriate task head or framework: - For **sequence classification**, load it with `AutoModelForSequenceClassification` (see the [text-classification guide](https://huggingface.co/docs/transformers/tasks/sequence_classification)). - For **token classification (NER)**, use `AutoModelForTokenClassification` (see the [token-classification guide](https://huggingface.co/docs/transformers/tasks/token_classification)). - For **embeddings / retrieval**, use [Sentence Transformers](https://www.sbert.net/) or [PyLate](https://github.com/lightonai/pylate). ## 📐 Model Overview | Property | Value | | --- | --- | | Architecture | ModernBERT | | Parameters | 149M total (110M backbone, 39M embeddings) | | Layers | 22 | | Hidden size | 768 | | Attention heads | 12 | | MLP | GeGLU | | Intermediate size | 1152 | | Context window | 8192 tokens | | Vocabulary size | 50,368 | | Language | French | ## 🔧 Training The tokenizer is a SentencePiece BPE model of 50,368 tokens, trained on the entity-rich FineMed-filtered subset (educational quality >= 4 and medical-term density >= 0.1) for efficient tokenization of dense medical vocabulary. It splits digits into single tokens and uses byte fallback for unseen characters. DoctoModernBERT is pretrained from scratch on a mix of FineMed-filtered (FineMed-fr under the same educational-quality and medical-term-density filter) and FineMed-rephrased-fr, over three phases totaling 240B tokens: 1. **Pretraining (200B tokens).** Masked-language-modeling at 1024-token context on the full mix (long documents are hierarchically chunked to fit the context window rather than truncated), building broad medical-language representations. 2. **Context extension (20B tokens).** Extends the context window from 1024 to 8192 tokens, training on a subset upsampled toward long documents. 3. **Annealing (20B tokens).** Continued training on the biomedical & clinical subdomains of the mix, focusing the final updates on content closest to downstream medical use. ## 📊 Evaluation Across 11 encoders from English medical, French generalist, and French medical families, DoctoModernBERT-fr-base achieves the best F1 on the real-world clinical NER task and ranks among the top encoders on the academic DrBenchmark. ### DrBenchmark We adapted the [DrBenchmark](https://github.com/doctolib-lab/DrBenchmark), filtered to 7 high-quality tasks: QUAERO (NER on EMEA drug leaflets and MEDLINE abstracts), E3C (clinical and temporal NER), DEFT-2021 (clinical NER), MorFITT (biomedical specialty classification), and DiaMed (clinical diagnostic classification). Per model and task, hyperparameters are tuned on the validation split, then the mean F1 over 5 test-split seeds is reported. We also report two per-model scores aggregated across all tasks: **Min-Max** rescales each task's scores onto a 0–100 scale before averaging, capturing the magnitude of the gaps (a model that lags on a few tasks is penalized heavily); **WP** (Win Probability) is the average percentage of tasks on which a model outscores each other model, capturing rank consistency (robust to outliers but blind to effect size).

Model	EMEA	MEDLINE	E3C-Clin	E3C-Temp	MORFITT	DEFT2021	DIAMED	Min-Max	WP
English medical
BioBERT	58.77	50.29	55.02	78.29	66.99	56.72	59.26	29.97	15.71
BioClinical-ModernBERT	44.74	44.44	49.53	76.11	67.42	53.97	52.07	0.88	1.43
ModernBERT-bio	56.84	46.60	53.76	78.85	68.57	56.43	61.06	29.35	17.14
French generalist
CamemBERT	65.43	56.18	59.82	83.81	71.54	62.40	60.26	69.37	57.14
ModernCamemBERT	61.98	55.46	57.62	83.11	70.01	60.01	53.26	52.69	28.57
French medical
DrBERT	64.37	57.18	58.01	82.44	70.42	61.08	64.87	65.08	44.29
CamemBERT-bio	64.98	59.03	61.40	84.88	71.48	64.73	64.63	80.83	70.00
TransBERT-bio-fr	67.37	59.96	62.36	84.48	74.04	65.48	70.91	93.88	88.57
ModernCamemBERT-bio	65.35	56.81	58.63	83.31	71.21	61.35	67.77	71.37	54.29
Ours
DoctoBERT-fr	68.39	62.54	62.75	84.60	73.36	66.41	72.56	98.17	97.14
DoctoModernBERT-fr	65.71	59.65	59.62	84.06	71.87	63.81	71.60	83.15	75.71

### Real-world Clinical NER A proprietary French clinical NER task from a real-world production setting, annotated with 12 entity types (e.g., pathology, drug, exam, biometry) and 9 qualifiers (e.g., negation, family relationship, date). Scores are the mean over 3 test-split seeds.

Model	Precision	Recall	F1
English medical
BioBERT	77.54	78.42	77.97
BioClinical-ModernBERT	78.79	78.69	78.74
ModernBERT-bio	78.06	79.30	78.67
French generalist
CamemBERT	77.19	79.58	78.36
ModernCamemBERT	78.53	78.71	78.62
French medical
DrBERT	76.77	77.81	77.28
CamemBERT-bio	77.51	78.90	78.19
TransBERT-bio-fr	76.85	78.66	77.74
ModernCamemBERT-bio	78.17	79.76	78.95
Ours
DoctoBERT-fr	77.29	79.68	78.47
DoctoModernBERT-fr	79.12	79.71	79.40

## ⚠️ Intended Use & Limitations DoctoModernBERT is an encoder for French biomedical and clinical NLP (NER, classification, retrieval, long-document tasks), used by fine-tuning on a downstream task. It is not generative and not a medical device; its outputs must not drive clinical decisions. It was pretrained on public web, PDF, and encyclopedic medical text and reflects the biases and gaps of those sources. ## ⚖️ License Released under Apache-2.0. DoctoModernBERT was trained on FineMed-fr and FineMed-rephrased-fr, which derive from FineWeb-2 / FinePDFs (ODC-BY 1.0) and FineWiki (CC BY-SA 4.0); please attribute those upstream sources. ## 🏛️ Acknowledgments This work was granted access to the HPC resources of IDRIS (Jean Zay) under the allocations 2025-AD011016291 and 2026-A0200617487 made by GENCI.