--- language: - fr license: apache-2.0 library_name: gliner2 pipeline_tag: token-classification base_model: fastino/gliner2-multi-v1 tags: - medical - healthcare - french - finemed - ner - gliner - medical-entity-recognition --- # FineMed Medical-Entity Extractor (FR) FineMed

🤗 Blog | 📄 Paper | 💻 Code | 🌐 FineMed | 🩺 DoctoBERT

## 📚 Introduction This is the **medical-entity extractor** used to compute the **medical-term density** axis of [FineMed-fr](https://huggingface.co/datasets/doctolib-lab/finemed-fr). Given a French medical document, it extracts medical-term spans under an **8-class** UMLS-adapted taxonomy (`disease`, `drug`, `body_part`, …). The density is then the ratio of characters inside the extracted spans to the document's total characters. It is a [GLiNER2](https://huggingface.co/fastino/gliner2-multi-v1) model (mDeBERTa-v3 backbone, 512-token context) fine-tuned on LLM annotations, one of the three lightweight annotators behind FineMed-fr (subdomain, educational quality, medical-term density). ## 🚀 How to Use ```python from gliner2 import GLiNER2 extractor = GLiNER2.from_pretrained("doctolib-lab/finemed-entity-extractor-fr") # 8-class taxonomy; passing descriptions (not just the keys) improves extraction labels = { "disease": "Pathological condition: disease, syndrome, infection, cancer, injury, symptom, clinical finding, mental disorder", "drug": "Chemical substance for therapy: prescription medication, vaccine, therapeutic compound, drug class, contrast agent", "body_part": "Anatomical structure: organ, tissue, bone, muscle, blood vessel, nerve, cell, body fluid, anatomical region", "medical_procedure": "Clinical action with methodology: surgery, diagnostic test, medical examination, laboratory test, imaging procedure", "molecular_marker": "Molecular entity or biochemical substance: gene, protein, enzyme, receptor, genetic variant, biochemical analyte", "clinical_device": "Manufactured medical object: surgical tool, implant, prosthetic, diagnostic scanner, monitoring equipment", "vital_function": "Physiological parameter name: heart rate, blood pressure, respiratory rate, temperature, oxygen saturation", "living_beings": "Non-human organism in biomedical context: bacterium, virus, fungus, parasite, pathogen, model organism", } text = "Le patient présente une pneumonie traitée par amoxicilline ..." results = extractor.batch_extract_entities([text], labels, threshold=0.5) print(results[0]["entities"]) # {"disease": ["pneumonie"], "drug": ["amoxicilline"], ...} ``` To reproduce FineMed's `medical_entity_density`, run extraction over the middle 512 tokens of each document, then divide the characters covered by the extracted spans by the document's total character count. Taking the middle window skips boilerplate at the document boundaries and keeps corpus-scale inference tractable. ## 🏷️ Entity Taxonomy 8 classes adapted from UMLS, keeping the medical-term-rich groups: | class | covers | | ----- | ------ | | `disease` | disease, syndrome, infection, cancer, injury, symptom, clinical finding, mental disorder | | `drug` | prescription medication, vaccine, therapeutic compound, drug class, contrast agent | | `body_part` | organ, tissue, bone, muscle, blood vessel, nerve, cell, body fluid, anatomical region | | `medical_procedure` | surgery, diagnostic test, medical examination, laboratory test, imaging procedure | | `molecular_marker` | gene, protein, enzyme, receptor, genetic variant, biochemical analyte | | `clinical_device` | surgical tool, implant, prosthetic, diagnostic scanner, monitoring equipment | | `vital_function` | heart rate, blood pressure, respiratory rate, temperature, oxygen saturation | | `living_beings` | bacterium, virus, fungus, parasite, pathogen, model organism | ## 🔧 Training Fine-tuned from GLiNER2 on entity annotations produced by [Qwen3-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct) via a two-pass self-review (Pass 1 extracts entities, Pass 2 reviews and corrects them) over roughly 300k documents. Best configuration: training prompts without per-class descriptions, inference prompts with descriptions. Entity-group order is shuffled during annotation to mitigate position bias. The two annotation prompts are in [`medical_entity_extract_prompt.txt`](https://huggingface.co/doctolib-lab/finemed-entity-extractor-fr/blob/main/assets/medical_entity_extract_prompt.txt) (Pass 1) and [`medical_entity_review_prompt.txt`](https://huggingface.co/doctolib-lab/finemed-entity-extractor-fr/blob/main/assets/medical_entity_review_prompt.txt) (Pass 2). ## ⚠️ Intended Use & Limitations Built to annotate French medical web text at corpus scale (to build FineMed-fr), not for clinical decision-making. It is tuned for density estimation over a 512-token window, not exhaustive document-level entity recognition. ## ⚖️ License Apache-2.0, inherited from the [GLiNER2](https://huggingface.co/fastino/gliner2-multi-v1) base model. ## 🏛️ Acknowledgments This work was granted access to the HPC resources of IDRIS (Jean Zay) under the allocations 2025-AD011016291 and 2026-A0200617487 made by GENCI.