--- license: apache-2.0 language: - en library_name: transformers tags: - medical-llm - clinical - healthcare - medicine - medical-ai - clinical-decision-support - medical - triage - esi - emergency-medicine - clinical-nlp - biomedbert - bert - multi-label-classification pipeline_tag: text-classification base_model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext metrics: - accuracy - f1 - recall model-index: - name: BiomedBERT-Triage-ESI-v42 results: - task: type: text-classification name: ESI Triage Classification metrics: - name: Accuracy (algorithm-correct) type: accuracy value: 0.917 - name: Accuracy (expert-labeled) type: accuracy value: 0.861 - name: High-risk recall (ESI 1-2) type: recall value: 0.920 - name: Within-1 accuracy type: accuracy value: 0.972 - name: Under-triage rate type: accuracy value: 0.083 --- # BiomedBERT-Triage-ESI v42 A BiomedBERT-based clinical field extractor for Emergency Severity Index (ESI) triage classification. The model extracts structured clinical fields from free-text triage notes, which are then processed by a deterministic ESI v5 algorithm to produce ESI levels 1-5. ## Key Results | Metric | Value | |--------|-------| | **Algorithm-correct accuracy** | **91.7%** (33/36) | | Expert-labeled accuracy | 86.1% (31/36) | | Within-1 accuracy | 97.2% (35/36) | | High-risk recall (ESI 1-2) | 92.0% | | Under-triage rate | 8.3% | | Over-triage rate | 5.6% | | Inference speed | 21ms/sample (MPS) | > **91.7% algorithm-correct**: 2 of 5 "errors" are cases where the model correctly follows ESI algorithm rules, but the expert applied clinical judgment that overrides the algorithm (CHF-related chest pain scored ESI 3 by expert vs ESI 2 by algorithm; isolated pelvic pain with stable vitals scored ESI 3 by expert vs ESI 2 by algorithm). ### Per-ESI Performance | ESI Level | Accuracy | Cases | Description | |-----------|----------|-------|-------------| | ESI 1 (Resuscitation) | **92.9%** | 13/14 | Cardiac arrest, respiratory failure, septic shock | | ESI 2 (Emergent) | **81.8%** | 9/11 | Chest pain, stroke, active seizure, sepsis | | ESI 3 (Urgent) | 60.0% | 3/5 | 2+ resources: labs, imaging, IV | | ESI 4 (Less urgent) | **100%** | 4/4 | 1 resource: X-ray or simple procedure | | ESI 5 (Non-urgent) | **100%** | 2/2 | 0 resources: med refill, suture removal | ## Architecture ``` Triage Note (free text) │ ▼ ┌─────────────────────────────────┐ │ BiomedBERT Encoder (110M) │ │ [CLS] token → hidden state │ └─────────┬───────────────────────┘ │ ┌─────┼─────┬─────┬─────┬─────┐ ▼ ▼ ▼ ▼ ▼ ▼ Symptom Flag Pain Arrival Resource Head Head Head Head Head (50) (5) (1) (5) (11) │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ ┌─────────────────────────────────┐ │ Deterministic ESI v5 Engine │ │ Step A → B1 → B2 → B3 → C → D │ │ (~250 lines pure Python) │ └─────────────┬───────────────────┘ ▼ ESI 1-5 + reasoning ``` The model **never predicts ESI directly**. It extracts structured fields, and a transparent, auditable ESI algorithm makes the final decision. Every prediction comes with step-by-step reasoning. ### Extraction Heads | Head | Output | Val Accuracy | |------|--------|-------------| | **Symptom** | 50 binary labels (chest_pain, fracture, sepsis_signs, ...) | 99.7% | | **Resource** | 11 binary labels (labs, ecg, xray, iv_fluids, ...) | 99.9% | | **Flag** | 5 binary flags (altered_mentation, needs_immediate_airway, ...) | — | | **Pain** | Regression 0-10 | MAE 0.09 | | **Arrival** | 5-class (ambulance, walk-in, helicopter, wheelchair, unknown) | — | ### ESI Algorithm (Deterministic) The ESI v5 algorithm is implemented as ~250 lines of pure Python with no text extraction: - **Step A**: Immediate lifesaving intervention? (airway, IV resuscitation, GCS ≤ 8, SBP < 80, SpO2 < 85) → **ESI 1** - **Step B1**: High-risk symptoms? (chest_pain, sepsis_signs, stroke, seizure, GI bleed, ...) → **ESI 2** - **Step B2**: Altered mental status? → **ESI 2** - **Step B3**: Severe pain/distress? (pain 10/10, systemic pain ≥ 9) → **ESI 2** - **Step C**: Resource counting (2+ → ESI 3, 1 → ESI 4, 0 → ESI 5) - **Step D**: Vital sign danger zones → uptriage ## Training ### Dataset 113,801 records from multiple sources: | Source | Records | Format | |--------|---------|--------| | MIMIC-IV-ED structured (gold ESI) | 96,099 | Compact | | MIETIC narrative (from MIMIC-IV-ED) | 9,629 | Narrative | | MIMIC-IV-ED generated narrative | 5,000 | Narrative | | Targeted augmentation (critical care) | 3,073 | Narrative | **Key insight**: Training data must include both compact and narrative formats. A model trained only on compact text ("CC: Chest pain | HR 110 BP 130/80...") fails on narrative clinical notes ("A 63-year-old male presents to the ED via ambulance with palpitations and dizziness...") — the real-world format. ### Training Configuration - **Base model**: `microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext` - **Epochs**: 5 - **Batch size**: 32 - **Learning rate**: 2e-5 (with linear warmup) - **Max sequence length**: 256 tokens - **Hardware**: NVIDIA DGX Spark (GB10 GPU) - **Training time**: ~2.5 hours (114K records) ### Training Approach 1. **Multi-head extraction**: Single BERT encoder with 5 classification heads trained jointly 2. **Gold labels from MIMIC-IV-ED**: ESI labels from real nurse triage decisions (acuity field), not synthetic 3. **Arrival mode from text**: Labels derived from text cues in training data ("ambulance", "transferred", "walk-in") 4. **Intervention flags from ICD codes**: `needs_immediate_airway` labeled based on ICD codes for respiratory failure (J96), cardiac arrest (I46); `needs_immediate_iv_resuscitation` from sepsis (A41/R65), shock (R57) 5. **Iterative surgical augmentation**: 40+ experiments targeting specific extraction errors with MIMIC-IV-ED data ### Key Findings from 40+ Experiments - **Data quality > model architecture**: Label cleaning, format diversity, and targeted augmentation improved accuracy more than architectural changes (contrastive learning, fusion heads, ESI direct prediction all failed) - **Multi-task dilution**: Every head beyond symptom + resource + flag + pain + arrival hurts accuracy. Diagnosis, severity, ESI, and resource-count heads all degraded extraction - **Narrative format gap**: The single biggest improvement came from adding narrative-format training data (72.2% → 86.1%) - **Rare label challenge**: Intervention flags (airway, IV resuscitation) at ~4% of data are hard for BERT-base to learn reliably - **Two-stage fine-tuning fails**: Freezing extraction layers and training ESI head always corrupts extraction representations ## Usage ```python import torch from transformers import AutoTokenizer, AutoModel import torch.nn as nn # Load model model_dir = "vadimbelsky/biomedbert-triage-esi-v42" tokenizer = AutoTokenizer.from_pretrained(model_dir) encoder = AutoModel.from_pretrained(model_dir) heads = torch.load(f"{model_dir}/classifier_heads.pt", map_location="cpu", weights_only=True) # Build heads SYMPTOM_LABELS = [ "chest_pain", "diaphoresis", "syncope", "stroke_symptoms", "altered_mental_status", "anaphylaxis", "sepsis_signs", "active_seizure", "suicidal_ideation", "homicidal", "psychotic", "abdominal_pain", "gi_bleed", "testicular_pain", "ovarian_torsion", "ectopic_pregnancy", "dka_signs", "toxic_ingestion", "post_ictal", "respiratory_distress", "shortness_of_breath", "wheezing", "headache", "nausea_vomiting", "back_pain", "laceration", "sprain", "fracture", "uri", "rash", "eye_pain", "ear_pain", "dental_pain", "wound", "burn", "fever", "hypothermia", "vaginal_bleeding", "urinary_symptoms", "active_hemorrhage", "dizziness", "palpitations", "cough", "diarrhea", "weakness", "anxiety", "dehydration", "allergic_reaction", "limb_pain", "constipation", ] FLAG_LABELS = ["altered_mentation", "severe_pain_distress", "active_hemorrhage", "needs_immediate_airway", "needs_immediate_iv_resuscitation"] h = encoder.config.hidden_size symptom_head = nn.Linear(h, len(SYMPTOM_LABELS)) symptom_head.load_state_dict(heads["symptom_head"]) flag_head = nn.Linear(h, len(FLAG_LABELS)) flag_head.load_state_dict(heads["flag_head"]) # Extract note = "A 52-year-old male was brought to the ED via ambulance with sepsis and hypotension. Critically hypotensive, requiring mechanical ventilation." enc = tokenizer(note, max_length=256, padding="max_length", truncation=True, return_tensors="pt") encoder.eval() with torch.no_grad(): cls = encoder(**enc).last_hidden_state[:, 0, :] sym_probs = torch.sigmoid(symptom_head(cls)).squeeze() flag_probs = torch.sigmoid(flag_head(cls)).squeeze() symptoms = [SYMPTOM_LABELS[i] for i, p in enumerate(sym_probs) if p > 0.5] flags = {FLAG_LABELS[i]: bool(p > 0.5) for i, p in enumerate(flag_probs)} print(f"Symptoms: {symptoms}") print(f"Flags: {flags}") # → Symptoms: ['sepsis_signs', 'respiratory_distress', 'shortness_of_breath', 'fever'] # → Flags: {'needs_immediate_airway': True, 'needs_immediate_iv_resuscitation': True, ...} # → Feed into ESI algorithm → Step A → ESI 1 ``` ## Evaluation Evaluated on 36 expert-labeled cases from MIETIC (narrative clinical cases derived from MIMIC-IV-ED, reviewed by 2-3 emergency medicine experts). ### Confusion Matrix ``` GT\Pred 1 2 3 4 5 ────────────────────────────── 1 13 0 1 0 0 2 0 10 0 1 0 3 0 2 2 1 0 4 0 0 0 4 0 5 0 0 0 0 2 ``` ### Error Analysis | Error | GT→Pred | Root Cause | |-------|---------|------------| | Cardiac arrest | 1→2 | "Vital signs absent" — rare phrasing, model can't extract intervention flag | | CHF chest pain | 3→2 | Chest pain genuinely in text. **Algorithm-correct**: ESI rules say chest_pain → B1 → ESI 2. Expert overrides with clinical context. | | Pelvic pain | 3→2 | Pain 9/10 with stable vitals. **Algorithm-correct**: B3 fires on severe distress. Expert considers stable vitals → ESI 3. | | Open fracture transfer | 2→4 | "Open fracture" compound term not understood. Model extracts fracture (1 resource) but misses hemorrhage severity. | | Crohn's flare | 2→3 | Unstable extraction: sepsis_signs detection for IBD presentations is marginal at BERT-base capacity. | ## Limitations - **Single-center data**: Based on MIMIC-IV-ED (Beth Israel Deaconess Medical Center) - **BERT-base capacity**: 110M parameters limits rare pattern learning (intervention flags, compound medical terms) - **Binary extraction**: Can't distinguish primary vs secondary symptoms (e.g., CHF chest pain vs ACS chest pain) - **English only**: Trained on English clinical text - **Not a medical device**: Research use only. Not validated for clinical deployment. ## Citation ```bibtex @misc{belsky2026biomedbert-triage, title={BiomedBERT-Triage-ESI: Clinical Field Extraction for Emergency Triage}, author={Belsky, Vadim}, year={2026}, url={https://huggingface.co/vadimbelsky/biomedbert-triage-esi-v42} } ```