--- language: - fr license: apache-2.0 library_name: transformers pipeline_tag: text-classification base_model: almanach/moderncamembert-base tags: - medical - healthcare - french - finemed - text-classification --- # FineMed Subdomain Classifier (FR)
π€ Blog | π Paper | π» Code | π FineMed | π©Ί DoctoBERT
## π Introduction This is the **medical-subdomain classifier** used to annotate [FineMed-fr](https://huggingface.co/datasets/doctolib-lab/finemed-fr). Given a French medical document, it predicts one of **15 medical subdomains** (e.g. *Clinical guidelines & pathways*, *Patient education & lifestyle*, *Biomedical & mechanistic science*). It is a [ModernCamemBERT-base](https://huggingface.co/almanach/moderncamembert-base) classifier distilled from LLM teachers, one of the three lightweight annotators behind FineMed-fr (subdomain, educational quality, medical-term density). ## π How to Use The classifier reads the document text with its URL prepended (`url + "\n\n" + text`), up to 8192 tokens. ```python import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer repo = "doctolib-lab/finemed-subdomain-classifier-fr" tok = AutoTokenizer.from_pretrained(repo) model = AutoModelForSequenceClassification.from_pretrained(repo).eval() url = "https://www.example.fr/article" text = "Le diabΓ¨te de type 2 est une maladie chronique ..." inputs = tok(url + "\n\n" + text, return_tensors="pt", truncation=True, max_length=8192) with torch.inference_mode(): probs = model(**inputs).logits.softmax(-1)[0] idx = probs.argmax().item() print(model.config.id2label[idx], round(probs[idx].item(), 3)) ``` ## π·οΈ Subdomain Taxonomy `best_class` is one of these 15 values: | subdomain | description | | --------- | ----------- | | Clinical cases & vignettes | Single-patient narratives: presentation, evaluation, management, outcomes; case-based teaching. | | Clinical guidelines & pathways | Non-patient-specific recommendations, algorithms, and standards; named guidelines or consensus statements. | | Patient education & lifestyle | Consumer-facing explanations and how-to advice on prevention, self-care, symptoms, diet, fitness, mental well-being. | | Wellness, supplements & CAM | Botanicals, vitamins, supplements, complementary or alternative therapies outside mainstream clinical guidance. | | Public health, policy & programs | Population surveillance, epidemiology, screening, laws and regulation, financing and insurance, community guidance. | | Commercial & promotional | Marketing or sales content: pricing, booking, calls-to-action, affiliate/SEO, comparative ads, testimonials. | | Drugs, trials & regulation | Drug development and evaluation: clinical trials, approvals and labels, PK/PD, safety monitoring, pharmacovigilance. | | Biomedical & mechanistic science | Experimental or preclinical research: labs, omics, pathways, cell/animal models, assays, mechanisms. | | Medical devices, diagnostics & imaging | Device or modality descriptions and clinical use; diagnostics, wearables, sensors, imaging. | | Health IT, telemedicine & operations | EHR/EMR, data standards, interoperability, analytics, telemedicine, workflow, staffing, procurement, logistics. | | Occupational health & safety | Workplace hazards, exposures, PPE, training, and compliance with occupational regulations. | | Health workforce education & training | Professional curricula, CME, certification, simulation, residency/fellowship information. | | Health services & facilities | Neutral descriptions of care-delivery models, service lines, facility capabilities, long-term/residential care. | | Other health | Health-related content that is unclear or insufficient to classify under the other subdomains. | | Others | Not clearly health-related, too brief, or lacking detail (e.g. navigation/boilerplate). | ## π§ Training The classifier is distilled from LLM teachers under a two-stage schedule, fine-tuning ModernCamemBERT-base at 8192-token input (document content + URL): - **Stage 1**: [Qwen3-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct) labels 1M documents (high-volume supervision). - **Stage 2**: [Qwen3-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct) labels 490k documents (high-quality supervision). The 15-class taxonomy was built through three rounds of LLM-driven iteration; class order is shuffled during annotation to mitigate position bias. The full annotation prompt is in [`subdomain_annotation_prompt.txt`](https://huggingface.co/doctolib-lab/finemed-subdomain-classifier-fr/blob/main/assets/subdomain_annotation_prompt.txt). ## β οΈ Intended Use & Limitations Built to annotate French medical web text at corpus scale (to build FineMed-fr), not for clinical decision-making. Predictions are noisier on short or boilerplate documents, which the *Others* / *Other health* classes are meant to absorb. ## βοΈ License Apache-2.0. ## ποΈ Acknowledgments This work was granted access to the HPC resources of IDRIS (Jean Zay) under the allocations 2025-AD011016291 and 2026-A0200617487 made by GENCI.