Qwen3.5-9B GRPO v46 — Emergency Severity Index (ESI) Triage

A LoRA adapter for Qwen3.5-9B that classifies emergency department triage narratives into ESI levels 1–5, trained with Group Relative Policy Optimization (GRPO).

Current best result on this task in our LLM track: 77.8% exact accuracy, 94.4% adjacent accuracy on the 36-case MIETIC expert-annotated evaluation set.

What it does

Given a triage narrative (chief complaint, vitals, history, arrival mode), the model produces a structured response:

EXTRACTION:
- Chief complaint: ...
- Vital signs: ...
- Red flags: ...

ESI ALGORITHM:
- Step A (lifesaving): ...
- Step B (high-risk): ...
- Step C (resources): ...
- Step D (vitals): ...

ANSWER: ESI 2

The Emergency Severity Index is a 5-level triage system used in U.S. emergency departments to prioritize patients by acuity. ESI 1 = immediate lifesaving intervention; ESI 5 = no resources needed.

How we got here — the design journey

This adapter is the result of a multi-experiment iteration. The reward function in v46 was not the first attempt — it was the design that survived several failure modes. We document the failures because they shaped the final reward.

Starting point: v43 SFT baseline

We first ran supervised fine-tuning on 26K CoT-formatted triage examples generated from MIMIC-IV-ED gold labels. The SFT model could follow the EXTRACTION → ALGORITHM → ANSWER format and reached approximately 65% exact accuracy. v43 became the warm-start for all subsequent GRPO experiments.

Failure 1: v45c — reward hacking via ESI-1 collapse

Our first GRPO attempt used a naive reward: +1 for correct prediction, 0 otherwise, with 20× oversampling of rare ESI-1 cases to address class imbalance. The model discovered an exploit: predict ESI-1 for everything. Because the reward had no penalty for false ESI-1, this maximized expected reward on the oversampled ESI-1 dominated stream. Eval collapsed to predicting critical-emergency for stable patients.

Lesson: every action must have a real cost in some scenarios. A zero-penalty action becomes the dominant policy.

Failure 2: extended training and the 0.0-zone exploit (v47)

After v46 (described below) achieved 77.8% exact, we tried continuing GRPO for another 750 steps to see if more training would push higher. Instead it dropped to 66.7% exact / 100% adjacent. The model converted v46's "free adjacent" zone into a deliberate hedging strategy: predict within ±1 of gold for guaranteed format and length bonuses without ever committing to the exact answer.

Lesson: rewards that are safe in moderation become exploits at scale. Free reward zones get systematically mined.

Failure 3: closing the 0.0 zone backfired (v48)

Reacting to v47, we tried penalizing all wrong predictions by distance: adjacent=-0.3, two-level=-0.5, three+level=-0.7. Performance crashed to 55.6% exact with 25% no-parse rate. The model learned to never commit at all: with no-parse penalty -0.5 and three-level penalty -0.7, not answering became safer than risking a large miss. Combined with token-budget pressure during training, the model learned to keep "thinking" indefinitely.

Lesson: the no-action baseline must be worse than every possible action. Otherwise the model opts out.

v46 (this model) — what worked

v46 sits between these failure modes. Its reward design:

Outcome	Reward
Correct, gold = ESI 1	+3.0
Correct, gold = ESI 2	+2.0
Correct, gold ≥ ESI 3	+1.0
Gold=1, pred=2 (under-triage critical)	-1.0
Gold=1, pred≥3 (severe under-triage)	-2.0
Gold=2, pred=1 (over-triage of urgent)	-0.5
Gold=3, pred=1 (over-triage adjacent)	-1.0
Gold=4, pred=1 (over-triage 3-level)	-1.5
Gold=5, pred=1 (over-triage 4-level)	-2.0
Any other wrong prediction	0.0 (safety valve)
No parseable answer	-0.5
Format bonus (EXTRACTION + ALGORITHM + ANSWER)	+0.1
Length bonus (≤300 tokens, scaled)	up to +0.3

The key insight is the 0.0 zone: predicting an adjacent wrong ESI (e.g., 3 when gold=2) earns no reward but no penalty. This was not a flaw — it was a safety valve that let the model commit to a near-correct answer when uncertain instead of refusing to answer. Combined with explicit asymmetric over- and under-triage penalties, the policy converged on accurate predictions with adjacent fallbacks rather than collapse or non-commitment.

Training configuration

Base model: Qwen3.5-9B
Method: GRPO (Group Relative Policy Optimization) via TRL
Warm-start: v43 SFT adapter (continues from that policy, doesn't restart)
LoRA: r=32, α=32, target modules = all attention and MLP projections
Data: MIMIC-IV-ED gold (25K) + MIETIC narratives, ESI-1 oversampled 5×, ESI-2 oversampled 3×
Steps: 750
Generations per step: G = 8
Effective batch: 8 (per_device=1, grad_accum=8)
Learning rate: 8e-7 (cosine schedule)
Max completion length: 512 tokens during training
Optimizer: adamw_8bit, β1=0.9, β2=0.99, weight_decay=0.1
Gradient clip: max_grad_norm=0.1
Hardware: NVIDIA GB10, ~27 hours

Evaluation

On the 36-case MIETIC expert-annotated evaluation set:

Model	Exact	Adjacent (±1)	No-parse
v43 SFT (warm-start)	~65%	—	—
v46 GRPO (this model)	77.8%	94.4%	~0%
v47 GRPO (continued v46)	66.7%	100.0%	~0%
v48 GRPO (closed 0-zone)	55.6%	75.0%	25%

Eval protocol

Generated with max_new_tokens=1024, temperature=0.1, do_sample=True
Answer parsed via regex ANSWER:\s*ESI\s*([1-5]) with fallback ESI\s+([1-5])\b
Adjacent = |pred - gold| ≤ 1

Per-class breakdown on v46

ESI gold	Correct	Adjacent
1	strong	very strong
2	strong	strong
3	moderate	strong
4	moderate	strong
5	strong	strong

Known weaknesses

Error analysis on v46's 8 incorrect cases revealed a dominant failure pattern: the model misses rules around already-performed lifesaving interventions. Three of eight errors involve patients who arrived intubated, with a chest tube, or with a central line — clear ESI-1 by the Step A criterion — but were assigned ESI-2. The model treats these as generically sick patients rather than recognizing the intervention itself triggers ESI-1.

Other observed gaps:

Severe pain rule (pain score ≥ 7 → ESI-2 trigger): occasionally missed
Open fractures (high-risk trigger → ESI-2): occasionally undertriaged
One case appears to be ambiguous between ESI-1 and ESI-2 (label noise)

These are clinical knowledge gaps, not reasoning structure problems. Addressing them likely requires more diverse training examples of these scenarios rather than further reward tuning.

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base = "Qwen/Qwen3.5-9B"
adapter = "vadimbelsky/qwen3.5-esi-triage-grpo-v46"

tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)
model.eval()

SYSTEM = (
    "You are an expert emergency triage nurse. "
    "Extract clinical fields, apply the ESI algorithm step by step, then state the ESI level. "
    "Be concise — stay under 150 words total."
)

case = """A 67-year-old male arrived via ambulance with sudden onset chest pain
radiating to the left arm, diaphoresis, and shortness of breath.
BP 88/60, HR 118, RR 24, SpO2 91%. History of MI and hypertension. Pain 9/10."""

prompt = tokenizer.apply_chat_template(
    [{"role": "system", "content": SYSTEM},
     {"role": "user",   "content": case}],
    tokenize=False, add_generation_prompt=True,
)
out = model.generate(
    **tokenizer(prompt, return_tensors="pt").to(model.device),
    max_new_tokens=1024, temperature=0.1, do_sample=True,
)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Intended use and limitations

This is a research model. It is not approved for clinical use and must not be used to make actual triage decisions for patients.

The model was trained on de-identified narratives derived from MIMIC-IV-ED. It reflects the labels and distribution of that dataset, with all the biases that implies (single-institution data, retrospective acuity assignments, U.S. ED practice).

Specific limitations:

36-case evaluation set is small — accuracy estimates have wide confidence intervals (~±13 pp)
Inter-rater agreement on ESI between experienced clinicians is approximately 65–80%, so the upper bound on this task is itself uncertain
The model occasionally fails to apply specific clinical rules (intubation criterion, severe pain rule, open fracture rule) — see "Known weaknesses" above
No multilingual support; trained on English narratives only

Citation and acknowledgments

If you use this model, please cite the base model (Qwen3.5) and the data sources (MIMIC-IV-ED and MIETIC). Training infrastructure and evaluation methodology developed at ScienceSoft as part of a broader medical SLM research initiative.

Training scripts and the full failure-mode analysis are documented in the project repository.

Downloads last month: 37

Model tree for vadimbelsky/qwen3.5-esi-triage-grpo-v46

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Adapter

(195)

this model