--- language: - fr license: apache-2.0 library_name: transformers pipeline_tag: fill-mask tags: - medical - healthcare - biomedical - clinical - french - modernbert - long-context - fill-mask - doctobert datasets: - doctolib-lab/finemed-fr - doctolib-lab/finemed-rephrased-fr --- # DoctoModernBERT-fr-base
🤗 Blog | 📄 Paper | 💻 Code | 🌐 FineMed | 🩺 DoctoBERT
## 📚 Introduction **DoctoModernBERT-fr-base** is a French medical encoder for biomedical and clinical NLP. It uses the **ModernBERT** architecture (149M parameters, up to 8192-token context) and is pretrained from scratch on [FineMed-fr](https://huggingface.co/datasets/doctolib-lab/finemed-fr) and [FineMed-rephrased-fr](https://huggingface.co/datasets/doctolib-lab/finemed-rephrased-fr). DoctoModernBERT performs best on a real-world proprietary clinical NER task and ranks among the top encoders on the academic DrBenchmark. Its training data is curated from heterogeneous open web corpora (FineWeb-2, FinePDFs, FineWiki), which bring scale, source and stylistic diversity, and inherited quality control. Each document is annotated along three axes: subdomain, educational quality, and medical-term density. DoctoModernBERT then trains on a mix of two parts: the filtered high-quality, entity-rich documents, and LLM-rephrased variants that raise medical-term density and vary the contexts around medical terms across many genres, audiences, and registers. This breadth and density build robustness to real-world clinical text, which is often noisy and inconsistent in style. For a classic RoBERTa encoder, see its sibling [DoctoBERT-fr-base](https://huggingface.co/doctolib-lab/doctobert-fr-base). ## 🚀 How to Use Requires a recent `transformers` with ModernBERT support. ### Fill-mask Using `AutoModelForMaskedLM`: ```python from transformers import AutoTokenizer, AutoModelForMaskedLM model_id = "doctolib-lab/doctomodernbert-fr-base" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForMaskedLM.from_pretrained(model_id) text = f"Le patient souffre d'une {tokenizer.mask_token} aiguë." inputs = tokenizer(text, return_tensors="pt") logits = model(**inputs).logits masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id) print(tokenizer.decode(logits[0, masked_index].argmax(-1))) ``` Using a `pipeline`: ```python from transformers import pipeline fill = pipeline("fill-mask", model="doctolib-lab/doctomodernbert-fr-base") print(fill(f"Le patient souffre d'une {fill.tokenizer.mask_token} aiguë.")) ``` For long inputs, load with `attn_implementation="flash_attention_2"` for faster, more memory-efficient attention. ### Fine-tuning DoctoModernBERT fine-tunes like any BERT/ModernBERT encoder, with the appropriate task head or framework: - For **sequence classification**, load it with `AutoModelForSequenceClassification` (see the [text-classification guide](https://huggingface.co/docs/transformers/tasks/sequence_classification)). - For **token classification (NER)**, use `AutoModelForTokenClassification` (see the [token-classification guide](https://huggingface.co/docs/transformers/tasks/token_classification)). - For **embeddings / retrieval**, use [Sentence Transformers](https://www.sbert.net/) or [PyLate](https://github.com/lightonai/pylate). ## 📐 Model Overview | Property | Value | | --- | --- | | Architecture | ModernBERT | | Parameters | 149M total (110M backbone, 39M embeddings) | | Layers | 22 | | Hidden size | 768 | | Attention heads | 12 | | MLP | GeGLU | | Intermediate size | 1152 | | Context window | 8192 tokens | | Vocabulary size | 50,368 | | Language | French | ## 🔧 Training The tokenizer is a SentencePiece BPE model of 50,368 tokens, trained on the entity-rich FineMed-filtered subset (educational quality >= 4 and medical-term density >= 0.1) for efficient tokenization of dense medical vocabulary. It splits digits into single tokens and uses byte fallback for unseen characters. DoctoModernBERT is pretrained from scratch on a mix of FineMed-filtered (FineMed-fr under the same educational-quality and medical-term-density filter) and FineMed-rephrased-fr, over three phases totaling 240B tokens: 1. **Pretraining (200B tokens).** Masked-language-modeling at 1024-token context on the full mix (long documents are hierarchically chunked to fit the context window rather than truncated), building broad medical-language representations. 2. **Context extension (20B tokens).** Extends the context window from 1024 to 8192 tokens, training on a subset upsampled toward long documents. 3. **Annealing (20B tokens).** Continued training on the biomedical & clinical subdomains of the mix, focusing the final updates on content closest to downstream medical use. ## 📊 Evaluation Across 11 encoders from English medical, French generalist, and French medical families, DoctoModernBERT-fr-base achieves the best F1 on the real-world clinical NER task and ranks among the top encoders on the academic DrBenchmark. ### DrBenchmark We adapted the [DrBenchmark](https://github.com/doctolib-lab/DrBenchmark), filtered to 7 high-quality tasks: QUAERO (NER on EMEA drug leaflets and MEDLINE abstracts), E3C (clinical and temporal NER), DEFT-2021 (clinical NER), MorFITT (biomedical specialty classification), and DiaMed (clinical diagnostic classification). Per model and task, hyperparameters are tuned on the validation split, then the mean F1 over 5 test-split seeds is reported. We also report two per-model scores aggregated across all tasks: **Min-Max** rescales each task's scores onto a 0–100 scale before averaging, capturing the magnitude of the gaps (a model that lags on a few tasks is penalized heavily); **WP** (Win Probability) is the average percentage of tasks on which a model outscores each other model, capturing rank consistency (robust to outliers but blind to effect size).| Model | EMEA | MEDLINE | E3C-Clin | E3C-Temp | MORFITT | DEFT2021 | DIAMED | Min-Max | WP |
|---|---|---|---|---|---|---|---|---|---|
| English medical | |||||||||
| BioBERT | 58.77 | 50.29 | 55.02 | 78.29 | 66.99 | 56.72 | 59.26 | 29.97 | 15.71 |
| BioClinical-ModernBERT | 44.74 | 44.44 | 49.53 | 76.11 | 67.42 | 53.97 | 52.07 | 0.88 | 1.43 |
| ModernBERT-bio | 56.84 | 46.60 | 53.76 | 78.85 | 68.57 | 56.43 | 61.06 | 29.35 | 17.14 |
| French generalist | |||||||||
| CamemBERT | 65.43 | 56.18 | 59.82 | 83.81 | 71.54 | 62.40 | 60.26 | 69.37 | 57.14 |
| ModernCamemBERT | 61.98 | 55.46 | 57.62 | 83.11 | 70.01 | 60.01 | 53.26 | 52.69 | 28.57 |
| French medical | |||||||||
| DrBERT | 64.37 | 57.18 | 58.01 | 82.44 | 70.42 | 61.08 | 64.87 | 65.08 | 44.29 |
| CamemBERT-bio | 64.98 | 59.03 | 61.40 | 84.88 | 71.48 | 64.73 | 64.63 | 80.83 | 70.00 |
| TransBERT-bio-fr | 67.37 | 59.96 | 62.36 | 84.48 | 74.04 | 65.48 | 70.91 | 93.88 | 88.57 |
| ModernCamemBERT-bio | 65.35 | 56.81 | 58.63 | 83.31 | 71.21 | 61.35 | 67.77 | 71.37 | 54.29 |
| Ours | |||||||||
| DoctoBERT-fr | 68.39 | 62.54 | 62.75 | 84.60 | 73.36 | 66.41 | 72.56 | 98.17 | 97.14 |
| DoctoModernBERT-fr | 65.71 | 59.65 | 59.62 | 84.06 | 71.87 | 63.81 | 71.60 | 83.15 | 75.71 |
| Model | Precision | Recall | F1 |
|---|---|---|---|
| English medical | |||
| BioBERT | 77.54 | 78.42 | 77.97 |
| BioClinical-ModernBERT | 78.79 | 78.69 | 78.74 |
| ModernBERT-bio | 78.06 | 79.30 | 78.67 |
| French generalist | |||
| CamemBERT | 77.19 | 79.58 | 78.36 |
| ModernCamemBERT | 78.53 | 78.71 | 78.62 |
| French medical | |||
| DrBERT | 76.77 | 77.81 | 77.28 |
| CamemBERT-bio | 77.51 | 78.90 | 78.19 |
| TransBERT-bio-fr | 76.85 | 78.66 | 77.74 |
| ModernCamemBERT-bio | 78.17 | 79.76 | 78.95 |
| Ours | |||
| DoctoBERT-fr | 77.29 | 79.68 | 78.47 |
| DoctoModernBERT-fr | 79.12 | 79.71 | 79.40 |