Qwen3.5-9B GRPO v49 — ESI Triage (LoRA Adapter)

LoRA adapter (r=32, ~225 MB) for Qwen3.5-9B trained with GRPO. v49 refines v46 by adding rule-aware reward bonuses that target specific clinical-rule failures identified in v46's error analysis.

Result on MIETIC-36 (dual-mode eval):

With thinking: 77.8% exact / 100.0% adjacent
Without thinking: 77.8% exact / 100.0% adjacent

v49 is the first model in this series to combine v46's exact accuracy with v47's zero-dangerous-error safety profile, and the first to produce identical results across thinking modes.

For a full merged version (no PEFT required at inference), see vadimbelsky/qwen3.5-esi-triage-grpo-v49-merged.

What changed from v46

Error triage on v46's 8 wrong cases revealed three rule-application failures:

Missed "lifesaving intervention already performed → ESI 1" (3 cases) — narratives with "intubated", "chest tube placed", "central line placed" not recognized as ESI-1 Step A triggers.
Missed severe pain rule (1 case) — pain ≥ 7 should anchor ESI ≤ 2 unless ESI-1 criteria are present.
Missed open injury rule (1 case) — open fractures and penetrating trauma should anchor ESI ≤ 2.

v49 adds rule-aware reward bonuses:

Trigger in case text (regex)	Reward modifier
`intubat	chest tube
`intubat	chest tube
`open fracture	penetrating
Pain ≥ 7 + gold = 2 + pred > 2	−0.5

Other changes (informed by v48's failure):

Training budget raised 512 → 1024 tokens (matches eval, eliminates clipping)
No-parse penalty hardened −0.5 → −2.0 (must dominate every wrong commitment)
Warm-start from v46, 300 steps at LR 2e-7 (refinement, not relearning)

Training metrics

v49 was the cleanest GRPO run of this series:

clipped_ratio held at 4–19% throughout (vs v48's 90%+)
reward positive from step 10, peaked at +0.49
reward_std consistently ~1.0+ (strong GRPO learning signal)
300 steps in 17h 48m on NVIDIA GB10

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base    = "Qwen/Qwen3.5-9B"
adapter = "vadimbelsky/qwen3.5-esi-triage-grpo-v49"

tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)
model.eval()

SYSTEM = (
    "You are an expert emergency triage nurse. "
    "Extract clinical fields, apply the ESI algorithm step by step, then state the ESI level. "
    "Be concise — stay under 150 words total."
)

case = ("A 78-year-old female arrived intubated for airway protection. "
        "Central line placed. BP 120/58, HR 150, RR 20, SpO2 97%.")

prompt = tokenizer.apply_chat_template(
    [{"role": "system", "content": SYSTEM},
     {"role": "user",   "content": case}],
    tokenize=False, add_generation_prompt=True,
)
out = model.generate(
    **tokenizer(prompt, return_tensors="pt").to(model.device),
    max_new_tokens=1024, temperature=0.1, do_sample=True,
)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Limitations

Research model. Not approved for clinical use. Rule bonuses target specific regex patterns measured failing in v46 — they don't generalize. See the v46 model card for the full design journey and failure-mode lessons that shaped v49.

Downloads last month: 4

Model tree for vadimbelsky/qwen3.5-esi-triage-grpo-v49

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Adapter

(514)

this model