doctolib-lab/finemed-fr
Viewer • Updated • 21.1M • 2.15k • 5
A French medical pretraining corpus, its LLM-rephrased variant, and the annotators that built them.
Note Large-scale medical web corpus, annotated along multiple quality axes - medical subdomain, educational quality, and medical-term density.
Note Large-scale synthetic medical corpus, LLM-rephrased from web text to densify medical content and diversify its contexts.
Note Medical-subdomain classifier (15 classes).
Note Educational-quality scorer (0–5).
Note Medical entity extractor (8 classes).