BiomedBERT-Triage-ESI v42

A BiomedBERT-based clinical field extractor for Emergency Severity Index (ESI) triage classification. The model extracts structured clinical fields from free-text triage notes, which are then processed by a deterministic ESI v5 algorithm to produce ESI levels 1-5.

Key Results

Metric Value
Algorithm-correct accuracy 91.7% (33/36)
Expert-labeled accuracy 86.1% (31/36)
Within-1 accuracy 97.2% (35/36)
High-risk recall (ESI 1-2) 92.0%
Under-triage rate 8.3%
Over-triage rate 5.6%
Inference speed 21ms/sample (MPS)

91.7% algorithm-correct: 2 of 5 "errors" are cases where the model correctly follows ESI algorithm rules, but the expert applied clinical judgment that overrides the algorithm (CHF-related chest pain scored ESI 3 by expert vs ESI 2 by algorithm; isolated pelvic pain with stable vitals scored ESI 3 by expert vs ESI 2 by algorithm).

Per-ESI Performance

ESI Level Accuracy Cases Description
ESI 1 (Resuscitation) 92.9% 13/14 Cardiac arrest, respiratory failure, septic shock
ESI 2 (Emergent) 81.8% 9/11 Chest pain, stroke, active seizure, sepsis
ESI 3 (Urgent) 60.0% 3/5 2+ resources: labs, imaging, IV
ESI 4 (Less urgent) 100% 4/4 1 resource: X-ray or simple procedure
ESI 5 (Non-urgent) 100% 2/2 0 resources: med refill, suture removal

Architecture

Triage Note (free text)
    โ”‚
    โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  BiomedBERT Encoder (110M)      โ”‚
โ”‚  [CLS] token โ†’ hidden state     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
          โ”‚
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”
    โ–ผ     โ–ผ     โ–ผ     โ–ผ     โ–ผ     โ–ผ
 Symptom Flag  Pain  Arrival Resource
 Head   Head  Head   Head   Head
 (50)   (5)   (1)    (5)   (11)
    โ”‚     โ”‚     โ”‚     โ”‚     โ”‚
    โ–ผ     โ–ผ     โ–ผ     โ–ผ     โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Deterministic ESI v5 Engine    โ”‚
โ”‚  Step A โ†’ B1 โ†’ B2 โ†’ B3 โ†’ C โ†’ D โ”‚
โ”‚  (~250 lines pure Python)       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
              โ–ผ
         ESI 1-5 + reasoning

The model never predicts ESI directly. It extracts structured fields, and a transparent, auditable ESI algorithm makes the final decision. Every prediction comes with step-by-step reasoning.

Extraction Heads

Head Output Val Accuracy
Symptom 50 binary labels (chest_pain, fracture, sepsis_signs, ...) 99.7%
Resource 11 binary labels (labs, ecg, xray, iv_fluids, ...) 99.9%
Flag 5 binary flags (altered_mentation, needs_immediate_airway, ...) โ€”
Pain Regression 0-10 MAE 0.09
Arrival 5-class (ambulance, walk-in, helicopter, wheelchair, unknown) โ€”

ESI Algorithm (Deterministic)

The ESI v5 algorithm is implemented as ~250 lines of pure Python with no text extraction:

  • Step A: Immediate lifesaving intervention? (airway, IV resuscitation, GCS โ‰ค 8, SBP < 80, SpO2 < 85) โ†’ ESI 1
  • Step B1: High-risk symptoms? (chest_pain, sepsis_signs, stroke, seizure, GI bleed, ...) โ†’ ESI 2
  • Step B2: Altered mental status? โ†’ ESI 2
  • Step B3: Severe pain/distress? (pain 10/10, systemic pain โ‰ฅ 9) โ†’ ESI 2
  • Step C: Resource counting (2+ โ†’ ESI 3, 1 โ†’ ESI 4, 0 โ†’ ESI 5)
  • Step D: Vital sign danger zones โ†’ uptriage

Training

Dataset

113,801 records from multiple sources:

Source Records Format
MIMIC-IV-ED structured (gold ESI) 96,099 Compact
MIETIC narrative (from MIMIC-IV-ED) 9,629 Narrative
MIMIC-IV-ED generated narrative 5,000 Narrative
Targeted augmentation (critical care) 3,073 Narrative

Key insight: Training data must include both compact and narrative formats. A model trained only on compact text ("CC: Chest pain | HR 110 BP 130/80...") fails on narrative clinical notes ("A 63-year-old male presents to the ED via ambulance with palpitations and dizziness...") โ€” the real-world format.

Training Configuration

  • Base model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
  • Epochs: 5
  • Batch size: 32
  • Learning rate: 2e-5 (with linear warmup)
  • Max sequence length: 256 tokens
  • Hardware: NVIDIA DGX Spark (GB10 GPU)
  • Training time: ~2.5 hours (114K records)

Training Approach

  1. Multi-head extraction: Single BERT encoder with 5 classification heads trained jointly
  2. Gold labels from MIMIC-IV-ED: ESI labels from real nurse triage decisions (acuity field), not synthetic
  3. Arrival mode from text: Labels derived from text cues in training data ("ambulance", "transferred", "walk-in")
  4. Intervention flags from ICD codes: needs_immediate_airway labeled based on ICD codes for respiratory failure (J96), cardiac arrest (I46); needs_immediate_iv_resuscitation from sepsis (A41/R65), shock (R57)
  5. Iterative surgical augmentation: 40+ experiments targeting specific extraction errors with MIMIC-IV-ED data

Key Findings from 40+ Experiments

  • Data quality > model architecture: Label cleaning, format diversity, and targeted augmentation improved accuracy more than architectural changes (contrastive learning, fusion heads, ESI direct prediction all failed)
  • Multi-task dilution: Every head beyond symptom + resource + flag + pain + arrival hurts accuracy. Diagnosis, severity, ESI, and resource-count heads all degraded extraction
  • Narrative format gap: The single biggest improvement came from adding narrative-format training data (72.2% โ†’ 86.1%)
  • Rare label challenge: Intervention flags (airway, IV resuscitation) at ~4% of data are hard for BERT-base to learn reliably
  • Two-stage fine-tuning fails: Freezing extraction layers and training ESI head always corrupts extraction representations

Usage

import torch
from transformers import AutoTokenizer, AutoModel
import torch.nn as nn

# Load model
model_dir = "vadimbelsky/biomedbert-triage-esi-v42"
tokenizer = AutoTokenizer.from_pretrained(model_dir)
encoder = AutoModel.from_pretrained(model_dir)
heads = torch.load(f"{model_dir}/classifier_heads.pt", map_location="cpu", weights_only=True)

# Build heads
SYMPTOM_LABELS = [
    "chest_pain", "diaphoresis", "syncope", "stroke_symptoms",
    "altered_mental_status", "anaphylaxis", "sepsis_signs", "active_seizure",
    "suicidal_ideation", "homicidal", "psychotic",
    "abdominal_pain", "gi_bleed", "testicular_pain", "ovarian_torsion",
    "ectopic_pregnancy", "dka_signs", "toxic_ingestion", "post_ictal",
    "respiratory_distress", "shortness_of_breath", "wheezing",
    "headache", "nausea_vomiting", "back_pain",
    "laceration", "sprain", "fracture", "uri", "rash",
    "eye_pain", "ear_pain", "dental_pain", "wound", "burn",
    "fever", "hypothermia", "vaginal_bleeding", "urinary_symptoms",
    "active_hemorrhage",
    "dizziness", "palpitations", "cough", "diarrhea", "weakness",
    "anxiety", "dehydration", "allergic_reaction", "limb_pain", "constipation",
]
FLAG_LABELS = ["altered_mentation", "severe_pain_distress", "active_hemorrhage",
               "needs_immediate_airway", "needs_immediate_iv_resuscitation"]

h = encoder.config.hidden_size
symptom_head = nn.Linear(h, len(SYMPTOM_LABELS))
symptom_head.load_state_dict(heads["symptom_head"])
flag_head = nn.Linear(h, len(FLAG_LABELS))
flag_head.load_state_dict(heads["flag_head"])

# Extract
note = "A 52-year-old male was brought to the ED via ambulance with sepsis and hypotension. Critically hypotensive, requiring mechanical ventilation."
enc = tokenizer(note, max_length=256, padding="max_length", truncation=True, return_tensors="pt")

encoder.eval()
with torch.no_grad():
    cls = encoder(**enc).last_hidden_state[:, 0, :]
    sym_probs = torch.sigmoid(symptom_head(cls)).squeeze()
    flag_probs = torch.sigmoid(flag_head(cls)).squeeze()

symptoms = [SYMPTOM_LABELS[i] for i, p in enumerate(sym_probs) if p > 0.5]
flags = {FLAG_LABELS[i]: bool(p > 0.5) for i, p in enumerate(flag_probs)}

print(f"Symptoms: {symptoms}")
print(f"Flags: {flags}")
# โ†’ Symptoms: ['sepsis_signs', 'respiratory_distress', 'shortness_of_breath', 'fever']
# โ†’ Flags: {'needs_immediate_airway': True, 'needs_immediate_iv_resuscitation': True, ...}
# โ†’ Feed into ESI algorithm โ†’ Step A โ†’ ESI 1

Evaluation

Evaluated on 36 expert-labeled cases from MIETIC (narrative clinical cases derived from MIMIC-IV-ED, reviewed by 2-3 emergency medicine experts).

Confusion Matrix

GT\Pred   1    2   3   4   5
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  1       13   0   1   0   0
  2        0  10   0   1   0
  3        0   2   2   1   0
  4        0   0   0   4   0
  5        0   0   0   0   2

Error Analysis

Error GTโ†’Pred Root Cause
Cardiac arrest 1โ†’2 "Vital signs absent" โ€” rare phrasing, model can't extract intervention flag
CHF chest pain 3โ†’2 Chest pain genuinely in text. Algorithm-correct: ESI rules say chest_pain โ†’ B1 โ†’ ESI 2. Expert overrides with clinical context.
Pelvic pain 3โ†’2 Pain 9/10 with stable vitals. Algorithm-correct: B3 fires on severe distress. Expert considers stable vitals โ†’ ESI 3.
Open fracture transfer 2โ†’4 "Open fracture" compound term not understood. Model extracts fracture (1 resource) but misses hemorrhage severity.
Crohn's flare 2โ†’3 Unstable extraction: sepsis_signs detection for IBD presentations is marginal at BERT-base capacity.

Limitations

  • Single-center data: Based on MIMIC-IV-ED (Beth Israel Deaconess Medical Center)
  • BERT-base capacity: 110M parameters limits rare pattern learning (intervention flags, compound medical terms)
  • Binary extraction: Can't distinguish primary vs secondary symptoms (e.g., CHF chest pain vs ACS chest pain)
  • English only: Trained on English clinical text
  • Not a medical device: Research use only. Not validated for clinical deployment.

Citation

@misc{belsky2026biomedbert-triage,
  title={BiomedBERT-Triage-ESI: Clinical Field Extraction for Emergency Triage},
  author={Belsky, Vadim},
  year={2026},
  url={https://huggingface.co/vadimbelsky/biomedbert-triage-esi-v42}
}
Downloads last month
43
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for vadimbelsky/biomedbert-triage-esi-v42

Space using vadimbelsky/biomedbert-triage-esi-v42 1

Evaluation results