Beyond Majority Voting — Evaluation Harness
Evaluation code and aggregated results for the paper “Beyond Majority Voting: A New Framework for Evaluating AI Diagnostic Systems Against Expert Consensus” (Kopanichuk et al., preprint submitted to Scientific Reports, 2026).
The harness implements the paper’s relative diagnostic-quality metrics —
RPAD (Relative Precision of Algorithmic Diagnostics) and RRAD (Relative
Recall of Algorithmic Diagnostics). Instead of scoring a model against a single
"gold" label, it compares the model against a panel of physicians and normalizes
that agreement by the agreement observed among the experts themselves. A value
above 1.0 means the model agrees with the panel at least as well as the experts
agree with one another.
⚠️ Data note (PHI). This repository contains only code and aggregated results. The raw patient↔physician dialogues contain personal health information and are not included here. The de-identified dataset — released under the written informed consent for open-access publication described in the paper — is available at https://huggingface.co/datasets/kopan/med-eval-360.
Method
- Setup.
n = 360Russian-language telemedicine dialogues. A panel ofz = 7resident-physician experts and each evaluated LLM independently produce a bag of up tok = 3diagnoses per case (and, separately, a routing / specialty recommendation). - Pairwise precision & recall (
src/pair_metrics.py, eqs. 1–2):P_AE@kandR_AE@kbetween the algorithmAand an expertE, built from the multiplicity function μ (eq. 9) and the characteristic function χ (eq. 10) implemented insrc/characteristic_functions.py. - Diagnosis matching (
src/matcher.py, Table 1, match functionM): clinically equivalent diagnoses worded differently must count as a match. Here matching is done via Russian morphological normalization (pymorphy3) + transliteration + a precomputed supervised match table (pair-match.json) and a normalization map (preprocessor.json). The table encodes the decisions of the supervised meta-model described in the paper (reported match-function quality: P = 0.91, R = 0.90, F1 = 0.91, accuracy = 0.98). - Relative metrics (
src/scores.py):code ScoreNamespaper metric definition optimisticP_opt@k,R_opt@k(eqs. 3–4)maxover experts ÷minover expert–expert pairsaveragedP_avg@k,R_avg@k(eqs. 5–6)meanover experts ÷meanover expert–expert pairsrealisticRPAD@k, RRAD@k (eqs. 7–8) hardness-weighted blend of the two ( H ∈ [0,1];H = 0→ optimistic,H = 1→ averaged)main.pyruns withK_MAX = 3andHARDNESS = 0.5.
Repository layout
service/ # evaluation harness
main.py # entry point: metrics for every model and k = 1..3
src/
config.py # model / metric / score enums + data paths
matcher.py # diagnosis matching (morphology + supervised pair table)
pair_metrics.py # pairwise precision / recall / F1 (eqs. 1, 2)
characteristic_functions.py # μ and χ (eqs. 9, 10)
scores.py # optimistic / averaged / realistic (eqs. 3–8)
metric.py # final metric assembly
text_processor.py # top-k helper
data/output/ # results (no PHI)
metrics_1-360.json # final relative metrics per model
failures.txt # log of unmatched diagnosis pairs (terms only)
preproc_failures.txt # log of preprocessing mismatches (terms only)
Models evaluated
giga_max, giga_plus, giga_pro, qwen, deepseekr1, deepseekv3,
llama_405b, mistral, gpt4o, deepseekr1distqwen32b.
In the paper, DeepSeek-V3 is the most reliable candidate, with GigaChat-Max
and GPT-4o not significantly different.
Installation & running
cd service
uv sync
# Fetch the de-identified inputs into ../data/input/ (file names must match config.py)
hf download kopan/med-eval-360 --repo-type dataset --local-dir ../data/input
python main.py # writes ../data/output/metrics_1-360.json
Input data format (data/input/, not shipped here)
// targets_1-360.json — expert annotations (assessors "01", "02", ...)
{ "01": { "diag": { "<case>": ["diagnosis", ...] },
"doc": { "<case>": ["specialty", ...] } }, ... }
// predicts_1-360_<model>.json — model predictions
{ "diag": { "<case>": ["d1", "d2", "d3"] },
"doc": { "<case>": ["s1", "s2", "s3"] } }
// pair-match.json — precomputed match decisions for diagnosis pairs
{ "<diagnosis A>|<diagnosis B>": [proba, is_match] }
// preprocessor.json — raw → normalized diagnosis map
{ "<raw diagnosis>": "<normalized>" }
// chats_1-360.json — patient↔physician dialogues (CONTAINS PHI; de-identified version in the dataset repo)
Results (data/output/metrics_1-360.json)
For each model and k = 1..3: scores (optimistic / averaged / realistic ×
precision / recall / F1, separately for diag and doc) plus the per-pair
one_vs_one values.
Ethics & consent
The study protocol was reviewed and approved by the Local Ethics Committee of the V.A. Almazov National Medical Research Centre, Ministry of Health of the Russian Federation (protocol No. 0310-24, 10 October 2024), and conducted in accordance with the Declaration of Helsinki. Written informed consent — including consent to publish de-identified study materials in an open-access publication — was obtained from all participants.
Citation
If you use this code, the metrics, or the dataset, please cite the paper:
@article{kopanichuk2026beyond,
title = {Beyond Majority Voting: A New Framework for Evaluating AI Diagnostic
Systems Against Expert Consensus},
author = {Kopanichuk, Ilia and Anokhin, Petr and Shaposhnikov, Vladimir and
Makharev, Vladimir and Tsapieva, Ekaterina and Bespalov, Iaroslav and
Gombolevskiy, Victor and Kurapeev, Dmitry and Dylov, Dmitry V. and
Oseledets, Ivan},
year = {2026},
note = {Preprint submitted to Scientific Reports},
}
And the de-identified dataset:
@misc{medeval360,
title = {Med-Eval-360: De-identified Russian Telemedicine Diagnostic Dialogues},
author = {Kopanichuk, Ilia and Anokhin, Petr and Shaposhnikov, Vladimir and
Makharev, Vladimir and Tsapieva, Ekaterina and Bespalov, Iaroslav and
Gombolevskiy, Victor and Kurapeev, Dmitry and Dylov, Dmitry V. and
Oseledets, Ivan},
year = {2026},
howpublished = {\url{https://huggingface.co/datasets/kopan/med-eval-360}},
}
Contact
Corresponding author: Ilia Kopanichuk — kopanichuk@airi.net (@kopanichuk)