BiomedBERT-Triage-ESI v42

A BiomedBERT-based clinical field extractor for Emergency Severity Index (ESI) triage classification. The model extracts structured clinical fields from free-text triage notes, which are then processed by a deterministic ESI v5 algorithm to produce ESI levels 1-5.

Key Results

Metric	Value
Algorithm-correct accuracy	91.7% (33/36)
Expert-labeled accuracy	86.1% (31/36)
Within-1 accuracy	97.2% (35/36)
High-risk recall (ESI 1-2)	92.0%
Under-triage rate	8.3%
Over-triage rate	5.6%
Inference speed	21ms/sample (MPS)

91.7% algorithm-correct: 2 of 5 "errors" are cases where the model correctly follows ESI algorithm rules, but the expert applied clinical judgment that overrides the algorithm (CHF-related chest pain scored ESI 3 by expert vs ESI 2 by algorithm; isolated pelvic pain with stable vitals scored ESI 3 by expert vs ESI 2 by algorithm).

Per-ESI Performance

ESI Level	Accuracy	Cases	Description
ESI 1 (Resuscitation)	92.9%	13/14	Cardiac arrest, respiratory failure, septic shock
ESI 2 (Emergent)	81.8%	9/11	Chest pain, stroke, active seizure, sepsis
ESI 3 (Urgent)	60.0%	3/5	2+ resources: labs, imaging, IV
ESI 4 (Less urgent)	100%	4/4	1 resource: X-ray or simple procedure
ESI 5 (Non-urgent)	100%	2/2	0 resources: med refill, suture removal

Architecture

Triage Note (free text)
    │
    ▼
┌─────────────────────────────────┐
│  BiomedBERT Encoder (110M)      │
│  [CLS] token → hidden state     │
└─────────┬───────────────────────┘
          │
    ┌─────┼─────┬─────┬─────┬─────┐
    ▼     ▼     ▼     ▼     ▼     ▼
 Symptom Flag  Pain  Arrival Resource
 Head   Head  Head   Head   Head
 (50)   (5)   (1)    (5)   (11)
    │     │     │     │     │
    ▼     ▼     ▼     ▼     ▼
┌─────────────────────────────────┐
│  Deterministic ESI v5 Engine    │
│  Step A → B1 → B2 → B3 → C → D │
│  (~250 lines pure Python)       │
└─────────────┬───────────────────┘
              ▼
         ESI 1-5 + reasoning

The model never predicts ESI directly. It extracts structured fields, and a transparent, auditable ESI algorithm makes the final decision. Every prediction comes with step-by-step reasoning.

Extraction Heads

Head	Output	Val Accuracy
Symptom	50 binary labels (chest_pain, fracture, sepsis_signs, ...)	99.7%
Resource	11 binary labels (labs, ecg, xray, iv_fluids, ...)	99.9%
Flag	5 binary flags (altered_mentation, needs_immediate_airway, ...)	—
Pain	Regression 0-10	MAE 0.09
Arrival	5-class (ambulance, walk-in, helicopter, wheelchair, unknown)	—

ESI Algorithm (Deterministic)

The ESI v5 algorithm is implemented as ~250 lines of pure Python with no text extraction:

Step A: Immediate lifesaving intervention? (airway, IV resuscitation, GCS ≤ 8, SBP < 80, SpO2 < 85) → ESI 1
Step B1: High-risk symptoms? (chest_pain, sepsis_signs, stroke, seizure, GI bleed, ...) → ESI 2
Step B2: Altered mental status? → ESI 2
Step B3: Severe pain/distress? (pain 10/10, systemic pain ≥ 9) → ESI 2
Step C: Resource counting (2+ → ESI 3, 1 → ESI 4, 0 → ESI 5)
Step D: Vital sign danger zones → uptriage

Training

Dataset

113,801 records from multiple sources:

Source	Records	Format
MIMIC-IV-ED structured (gold ESI)	96,099	Compact
MIETIC narrative (from MIMIC-IV-ED)	9,629	Narrative
MIMIC-IV-ED generated narrative	5,000	Narrative
Targeted augmentation (critical care)	3,073	Narrative

Key insight: Training data must include both compact and narrative formats. A model trained only on compact text ("CC: Chest pain | HR 110 BP 130/80...") fails on narrative clinical notes ("A 63-year-old male presents to the ED via ambulance with palpitations and dizziness...") — the real-world format.

Training Configuration

Base model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
Epochs: 5
Batch size: 32
Learning rate: 2e-5 (with linear warmup)
Max sequence length: 256 tokens
Hardware: NVIDIA DGX Spark (GB10 GPU)
Training time: ~2.5 hours (114K records)

Training Approach

Multi-head extraction: Single BERT encoder with 5 classification heads trained jointly
Gold labels from MIMIC-IV-ED: ESI labels from real nurse triage decisions (acuity field), not synthetic
Arrival mode from text: Labels derived from text cues in training data ("ambulance", "transferred", "walk-in")
Intervention flags from ICD codes: needs_immediate_airway labeled based on ICD codes for respiratory failure (J96), cardiac arrest (I46); needs_immediate_iv_resuscitation from sepsis (A41/R65), shock (R57)
Iterative surgical augmentation: 40+ experiments targeting specific extraction errors with MIMIC-IV-ED data

Key Findings from 40+ Experiments

Data quality > model architecture: Label cleaning, format diversity, and targeted augmentation improved accuracy more than architectural changes (contrastive learning, fusion heads, ESI direct prediction all failed)
Multi-task dilution: Every head beyond symptom + resource + flag + pain + arrival hurts accuracy. Diagnosis, severity, ESI, and resource-count heads all degraded extraction
Narrative format gap: The single biggest improvement came from adding narrative-format training data (72.2% → 86.1%)
Rare label challenge: Intervention flags (airway, IV resuscitation) at ~4% of data are hard for BERT-base to learn reliably
Two-stage fine-tuning fails: Freezing extraction layers and training ESI head always corrupts extraction representations

Usage

import torch
from transformers import AutoTokenizer, AutoModel
import torch.nn as nn

# Load model
model_dir = "vadimbelsky/biomedbert-triage-esi-v42"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
encoder = AutoModel.from_pretrained(model_dir)
heads = torch.load(f"{model_dir}/classifier_heads.pt", map_location="cpu", weights_only=True)

# Build heads
SYMPTOM_LABELS = [
    "chest_pain", "diaphoresis", "syncope", "stroke_symptoms",
    "altered_mental_status", "anaphylaxis", "sepsis_signs", "active_seizure",
    "suicidal_ideation", "homicidal", "psychotic",
    "abdominal_pain", "gi_bleed", "testicular_pain", "ovarian_torsion",
    "ectopic_pregnancy", "dka_signs", "toxic_ingestion", "post_ictal",
    "respiratory_distress", "shortness_of_breath", "wheezing",
    "headache", "nausea_vomiting", "back_pain",
    "laceration", "sprain", "fracture", "uri", "rash",
    "eye_pain", "ear_pain", "dental_pain", "wound", "burn",
    "fever", "hypothermia", "vaginal_bleeding", "urinary_symptoms",
    "active_hemorrhage",
    "dizziness", "palpitations", "cough", "diarrhea", "weakness",
    "anxiety", "dehydration", "allergic_reaction", "limb_pain", "constipation",
]
FLAG_LABELS = ["altered_mentation", "severe_pain_distress", "active_hemorrhage",
               "needs_immediate_airway", "needs_immediate_iv_resuscitation"]

h = encoder.config.hidden_size
symptom_head = nn.Linear(h, len(SYMPTOM_LABELS))
symptom_head.load_state_dict(heads["symptom_head"])
flag_head = nn.Linear(h, len(FLAG_LABELS))
flag_head.load_state_dict(heads["flag_head"])

# Extract
note = "A 52-year-old male was brought to the ED via ambulance with sepsis and hypotension. Critically hypotensive, requiring mechanical ventilation."
enc = tokenizer(note, max_length=256, padding="max_length", truncation=True, return_tensors="pt")

encoder.eval()
with torch.no_grad():
    cls = encoder(**enc).last_hidden_state[:, 0, :]
    sym_probs = torch.sigmoid(symptom_head(cls)).squeeze()
    flag_probs = torch.sigmoid(flag_head(cls)).squeeze()

symptoms = [SYMPTOM_LABELS[i] for i, p in enumerate(sym_probs) if p > 0.5]
flags = {FLAG_LABELS[i]: bool(p > 0.5) for i, p in enumerate(flag_probs)}

print(f"Symptoms: {symptoms}")
print(f"Flags: {flags}")
# → Symptoms: ['sepsis_signs', 'respiratory_distress', 'shortness_of_breath', 'fever']
# → Flags: {'needs_immediate_airway': True, 'needs_immediate_iv_resuscitation': True, ...}
# → Feed into ESI algorithm → Step A → ESI 1

Evaluation

Evaluated on 36 expert-labeled cases from MIETIC (narrative clinical cases derived from MIMIC-IV-ED, reviewed by 2-3 emergency medicine experts).

Confusion Matrix

GT\Pred   1    2   3   4   5
──────────────────────────────
  1       13   0   1   0   0
  2        0  10   0   1   0
  3        0   2   2   1   0
  4        0   0   0   4   0
  5        0   0   0   0   2

Error Analysis

Error	GT→Pred	Root Cause
Cardiac arrest	1→2	"Vital signs absent" — rare phrasing, model can't extract intervention flag
CHF chest pain	3→2	Chest pain genuinely in text. Algorithm-correct: ESI rules say chest_pain → B1 → ESI 2. Expert overrides with clinical context.
Pelvic pain	3→2	Pain 9/10 with stable vitals. Algorithm-correct: B3 fires on severe distress. Expert considers stable vitals → ESI 3.
Open fracture transfer	2→4	"Open fracture" compound term not understood. Model extracts fracture (1 resource) but misses hemorrhage severity.
Crohn's flare	2→3	Unstable extraction: sepsis_signs detection for IBD presentations is marginal at BERT-base capacity.

Limitations

Single-center data: Based on MIMIC-IV-ED (Beth Israel Deaconess Medical Center)
BERT-base capacity: 110M parameters limits rare pattern learning (intervention flags, compound medical terms)
Binary extraction: Can't distinguish primary vs secondary symptoms (e.g., CHF chest pain vs ACS chest pain)
English only: Trained on English clinical text
Not a medical device: Research use only. Not validated for clinical deployment.

Citation

@misc{belsky2026biomedbert-triage,
  title={BiomedBERT-Triage-ESI: Clinical Field Extraction for Emergency Triage},
  author={Belsky, Vadim},
  year={2026},
  url={https://huggingface.co/vadimbelsky/biomedbert-triage-esi-v42}
}

Downloads last month: 9

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for vadimbelsky/biomedbert-triage-esi-v42

Base model

microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext

Finetuned

(193)

this model

Evaluation results

Accuracy (algorithm-correct)
self-reported

0.917
Accuracy (expert-labeled)
self-reported

0.861
High-risk recall (ESI 1-2)
self-reported

0.920
Within-1 accuracy
self-reported

0.972
Under-triage rate
self-reported

0.083