--- language: - fr license: apache-2.0 library_name: transformers pipeline_tag: fill-mask tags: - medical - healthcare - biomedical - clinical - french - modernbert - long-context - fill-mask - doctobert datasets: - doctolib-lab/finemed-fr - doctolib-lab/finemed-rephrased-fr --- # DoctoModernBERT-fr-base
DoctoModernBERT

🤗 Blog | 📄 Paper | 💻 Code | 🌐 FineMed | 🩺 DoctoBERT

## 📚 Introduction **DoctoModernBERT-fr-base** is a French medical encoder for biomedical and clinical NLP. It uses the **ModernBERT** architecture (149M parameters, up to 8192-token context) and is pretrained from scratch on [FineMed-fr](https://huggingface.co/datasets/doctolib-lab/finemed-fr) and [FineMed-rephrased-fr](https://huggingface.co/datasets/doctolib-lab/finemed-rephrased-fr). DoctoModernBERT performs best on a real-world proprietary clinical NER task and ranks among the top encoders on the academic DrBenchmark. Its training data is curated from heterogeneous open web corpora (FineWeb-2, FinePDFs, FineWiki), which bring scale, source and stylistic diversity, and inherited quality control. Each document is annotated along three axes: subdomain, educational quality, and medical-term density. DoctoModernBERT then trains on a mix of two parts: the filtered high-quality, entity-rich documents, and LLM-rephrased variants that raise medical-term density and vary the contexts around medical terms across many genres, audiences, and registers. This breadth and density build robustness to real-world clinical text, which is often noisy and inconsistent in style. For a classic RoBERTa encoder, see its sibling [DoctoBERT-fr-base](https://huggingface.co/doctolib-lab/doctobert-fr-base). ## 🚀 How to Use Requires a recent `transformers` with ModernBERT support. ### Fill-mask Using `AutoModelForMaskedLM`: ```python from transformers import AutoTokenizer, AutoModelForMaskedLM model_id = "doctolib-lab/doctomodernbert-fr-base" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForMaskedLM.from_pretrained(model_id) text = f"Le patient souffre d'une {tokenizer.mask_token} aiguë." inputs = tokenizer(text, return_tensors="pt") logits = model(**inputs).logits masked_index = inputs["input_ids"][0].tolist().index(tokenizer.mask_token_id) print(tokenizer.decode(logits[0, masked_index].argmax(-1))) ``` Using a `pipeline`: ```python from transformers import pipeline fill = pipeline("fill-mask", model="doctolib-lab/doctomodernbert-fr-base") print(fill(f"Le patient souffre d'une {fill.tokenizer.mask_token} aiguë.")) ``` For long inputs, load with `attn_implementation="flash_attention_2"` for faster, more memory-efficient attention. ### Fine-tuning DoctoModernBERT fine-tunes like any BERT/ModernBERT encoder, with the appropriate task head or framework: - For **sequence classification**, load it with `AutoModelForSequenceClassification` (see the [text-classification guide](https://huggingface.co/docs/transformers/tasks/sequence_classification)). - For **token classification (NER)**, use `AutoModelForTokenClassification` (see the [token-classification guide](https://huggingface.co/docs/transformers/tasks/token_classification)). - For **embeddings / retrieval**, use [Sentence Transformers](https://www.sbert.net/) or [PyLate](https://github.com/lightonai/pylate). ## 📐 Model Overview | Property | Value | | --- | --- | | Architecture | ModernBERT | | Parameters | 149M total (110M backbone, 39M embeddings) | | Layers | 22 | | Hidden size | 768 | | Attention heads | 12 | | MLP | GeGLU | | Intermediate size | 1152 | | Context window | 8192 tokens | | Vocabulary size | 50,368 | | Language | French | ## 🔧 Training The tokenizer is a SentencePiece BPE model of 50,368 tokens, trained on the entity-rich FineMed-filtered subset (educational quality >= 4 and medical-term density >= 0.1) for efficient tokenization of dense medical vocabulary. It splits digits into single tokens and uses byte fallback for unseen characters. DoctoModernBERT is pretrained from scratch on a mix of FineMed-filtered (FineMed-fr under the same educational-quality and medical-term-density filter) and FineMed-rephrased-fr, over three phases totaling 240B tokens: 1. **Pretraining (200B tokens).** Masked-language-modeling at 1024-token context on the full mix (long documents are hierarchically chunked to fit the context window rather than truncated), building broad medical-language representations. 2. **Context extension (20B tokens).** Extends the context window from 1024 to 8192 tokens, training on a subset upsampled toward long documents. 3. **Annealing (20B tokens).** Continued training on the biomedical & clinical subdomains of the mix, focusing the final updates on content closest to downstream medical use. ## 📊 Evaluation Across 11 encoders from English medical, French generalist, and French medical families, DoctoModernBERT-fr-base achieves the best F1 on the real-world clinical NER task and ranks among the top encoders on the academic DrBenchmark. ### DrBenchmark We adapted the [DrBenchmark](https://github.com/doctolib-lab/DrBenchmark), filtered to 7 high-quality tasks: QUAERO (NER on EMEA drug leaflets and MEDLINE abstracts), E3C (clinical and temporal NER), DEFT-2021 (clinical NER), MorFITT (biomedical specialty classification), and DiaMed (clinical diagnostic classification). Per model and task, hyperparameters are tuned on the validation split, then the mean F1 over 5 test-split seeds is reported. We also report two per-model scores aggregated across all tasks: **Min-Max** rescales each task's scores onto a 0–100 scale before averaging, capturing the magnitude of the gaps (a model that lags on a few tasks is penalized heavily); **WP** (Win Probability) is the average percentage of tasks on which a model outscores each other model, capturing rank consistency (robust to outliers but blind to effect size).
ModelEMEAMEDLINEE3C-ClinE3C-TempMORFITTDEFT2021DIAMEDMin-MaxWP
English medical
BioBERT58.7750.2955.0278.2966.9956.7259.2629.9715.71
BioClinical-ModernBERT44.7444.4449.5376.1167.4253.9752.070.881.43
ModernBERT-bio56.8446.6053.7678.8568.5756.4361.0629.3517.14
French generalist
CamemBERT65.4356.1859.8283.8171.5462.4060.2669.3757.14
ModernCamemBERT61.9855.4657.6283.1170.0160.0153.2652.6928.57
French medical
DrBERT64.3757.1858.0182.4470.4261.0864.8765.0844.29
CamemBERT-bio64.9859.0361.4084.8871.4864.7364.6380.8370.00
TransBERT-bio-fr67.3759.9662.3684.4874.0465.4870.9193.8888.57
ModernCamemBERT-bio65.3556.8158.6383.3171.2161.3567.7771.3754.29
Ours
DoctoBERT-fr68.3962.5462.7584.6073.3666.4172.5698.1797.14
DoctoModernBERT-fr65.7159.6559.6284.0671.8763.8171.6083.1575.71
### Real-world Clinical NER A proprietary French clinical NER task from a real-world production setting, annotated with 12 entity types (e.g., pathology, drug, exam, biometry) and 9 qualifiers (e.g., negation, family relationship, date). Scores are the mean over 3 test-split seeds.
ModelPrecisionRecallF1
English medical
BioBERT77.5478.4277.97
BioClinical-ModernBERT78.7978.6978.74
ModernBERT-bio78.0679.3078.67
French generalist
CamemBERT77.1979.5878.36
ModernCamemBERT78.5378.7178.62
French medical
DrBERT76.7777.8177.28
CamemBERT-bio77.5178.9078.19
TransBERT-bio-fr76.8578.6677.74
ModernCamemBERT-bio78.1779.7678.95
Ours
DoctoBERT-fr77.2979.6878.47
DoctoModernBERT-fr79.1279.7179.40
## ⚠️ Intended Use & Limitations DoctoModernBERT is an encoder for French biomedical and clinical NLP (NER, classification, retrieval, long-document tasks), used by fine-tuning on a downstream task. It is not generative and not a medical device; its outputs must not drive clinical decisions. It was pretrained on public web, PDF, and encyclopedic medical text and reflects the biases and gaps of those sources. ## ⚖️ License Released under Apache-2.0. DoctoModernBERT was trained on FineMed-fr and FineMed-rephrased-fr, which derive from FineWeb-2 / FinePDFs (ODC-BY 1.0) and FineWiki (CC BY-SA 4.0); please attribute those upstream sources. ## 🏛️ Acknowledgments This work was granted access to the HPC resources of IDRIS (Jean Zay) under the allocations 2025-AD011016291 and 2026-A0200617487 made by GENCI.