Beyond Majority Voting — Evaluation Harness

Evaluation code and aggregated results for the paper “Beyond Majority Voting: A New Framework for Evaluating AI Diagnostic Systems Against Expert Consensus” (Kopanichuk et al., preprint submitted to Scientific Reports, 2026).

The harness implements the paper’s relative diagnostic-quality metrics — RPAD (Relative Precision of Algorithmic Diagnostics) and RRAD (Relative Recall of Algorithmic Diagnostics). Instead of scoring a model against a single "gold" label, it compares the model against a panel of physicians and normalizes that agreement by the agreement observed among the experts themselves. A value above 1.0 means the model agrees with the panel at least as well as the experts agree with one another.

⚠️ Data note (PHI). This repository contains only code and aggregated results. The raw patient↔physician dialogues contain personal health information and are not included here. The de-identified dataset — released under the written informed consent for open-access publication described in the paper — is available at https://huggingface.co/datasets/kopan/med-eval-360.

Method

  • Setup. n = 360 Russian-language telemedicine dialogues. A panel of z = 7 resident-physician experts and each evaluated LLM independently produce a bag of up to k = 3 diagnoses per case (and, separately, a routing / specialty recommendation).
  • Pairwise precision & recall (src/pair_metrics.py, eqs. 1–2): P_AE@k and R_AE@k between the algorithm A and an expert E, built from the multiplicity function μ (eq. 9) and the characteristic function χ (eq. 10) implemented in src/characteristic_functions.py.
  • Diagnosis matching (src/matcher.py, Table 1, match function M): clinically equivalent diagnoses worded differently must count as a match. Here matching is done via Russian morphological normalization (pymorphy3) + transliteration + a precomputed supervised match table (pair-match.json) and a normalization map (preprocessor.json). The table encodes the decisions of the supervised meta-model described in the paper (reported match-function quality: P = 0.91, R = 0.90, F1 = 0.91, accuracy = 0.98).
  • Relative metrics (src/scores.py):
    code ScoreNames paper metric definition
    optimistic P_opt@k, R_opt@k (eqs. 3–4) max over experts ÷ min over expert–expert pairs
    averaged P_avg@k, R_avg@k (eqs. 5–6) mean over experts ÷ mean over expert–expert pairs
    realistic RPAD@k, RRAD@k (eqs. 7–8) hardness-weighted blend of the two (H ∈ [0,1]; H = 0 → optimistic, H = 1 → averaged)
    main.py runs with K_MAX = 3 and HARDNESS = 0.5.

Repository layout

service/                  # evaluation harness
  main.py                 # entry point: metrics for every model and k = 1..3
  src/
    config.py             # model / metric / score enums + data paths
    matcher.py            # diagnosis matching (morphology + supervised pair table)
    pair_metrics.py       # pairwise precision / recall / F1  (eqs. 1, 2)
    characteristic_functions.py  # μ and χ                    (eqs. 9, 10)
    scores.py             # optimistic / averaged / realistic (eqs. 3–8)
    metric.py             # final metric assembly
    text_processor.py     # top-k helper
data/output/              # results (no PHI)
  metrics_1-360.json      # final relative metrics per model
  failures.txt            # log of unmatched diagnosis pairs (terms only)
  preproc_failures.txt    # log of preprocessing mismatches (terms only)

Models evaluated

giga_max, giga_plus, giga_pro, qwen, deepseekr1, deepseekv3, llama_405b, mistral, gpt4o, deepseekr1distqwen32b. In the paper, DeepSeek-V3 is the most reliable candidate, with GigaChat-Max and GPT-4o not significantly different.

Installation & running

cd service
uv sync

# Fetch the de-identified inputs into ../data/input/ (file names must match config.py)
hf download kopan/med-eval-360 --repo-type dataset --local-dir ../data/input

python main.py     # writes ../data/output/metrics_1-360.json

Input data format (data/input/, not shipped here)

// targets_1-360.json — expert annotations (assessors "01", "02", ...)
{ "01": { "diag": { "<case>": ["diagnosis", ...] },
          "doc":  { "<case>": ["specialty", ...] } }, ... }

// predicts_1-360_<model>.json — model predictions
{ "diag": { "<case>": ["d1", "d2", "d3"] },
  "doc":  { "<case>": ["s1", "s2", "s3"] } }

// pair-match.json — precomputed match decisions for diagnosis pairs
{ "<diagnosis A>|<diagnosis B>": [proba, is_match] }

// preprocessor.json — raw → normalized diagnosis map
{ "<raw diagnosis>": "<normalized>" }

// chats_1-360.json — patient↔physician dialogues (CONTAINS PHI; de-identified version in the dataset repo)

Results (data/output/metrics_1-360.json)

For each model and k = 1..3: scores (optimistic / averaged / realistic × precision / recall / F1, separately for diag and doc) plus the per-pair one_vs_one values.

Ethics & consent

The study protocol was reviewed and approved by the Local Ethics Committee of the V.A. Almazov National Medical Research Centre, Ministry of Health of the Russian Federation (protocol No. 0310-24, 10 October 2024), and conducted in accordance with the Declaration of Helsinki. Written informed consent — including consent to publish de-identified study materials in an open-access publication — was obtained from all participants.

Citation

If you use this code, the metrics, or the dataset, please cite the paper:

@article{kopanichuk2026beyond,
  title   = {Beyond Majority Voting: A New Framework for Evaluating AI Diagnostic
             Systems Against Expert Consensus},
  author  = {Kopanichuk, Ilia and Anokhin, Petr and Shaposhnikov, Vladimir and
             Makharev, Vladimir and Tsapieva, Ekaterina and Bespalov, Iaroslav and
             Gombolevskiy, Victor and Kurapeev, Dmitry and Dylov, Dmitry V. and
             Oseledets, Ivan},
  year    = {2026},
  note    = {Preprint submitted to Scientific Reports},
}

And the de-identified dataset:

@misc{medeval360,
  title        = {Med-Eval-360: De-identified Russian Telemedicine Diagnostic Dialogues},
  author       = {Kopanichuk, Ilia and Anokhin, Petr and Shaposhnikov, Vladimir and
                  Makharev, Vladimir and Tsapieva, Ekaterina and Bespalov, Iaroslav and
                  Gombolevskiy, Victor and Kurapeev, Dmitry and Dylov, Dmitry V. and
                  Oseledets, Ivan},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/datasets/kopan/med-eval-360}},
}

Contact

Corresponding author: Ilia Kopanichuk — kopanichuk@airi.net (@kopanichuk)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support