Qwen3.5-9B GRPO v49 β€” ESI Triage (LoRA Adapter)

LoRA adapter (r=32, ~225 MB) for Qwen3.5-9B trained with GRPO. v49 refines v46 by adding rule-aware reward bonuses that target specific clinical-rule failures identified in v46's error analysis.

Result on MIETIC-36 (dual-mode eval):

  • With thinking: 77.8% exact / 100.0% adjacent
  • Without thinking: 77.8% exact / 100.0% adjacent

v49 is the first model in this series to combine v46's exact accuracy with v47's zero-dangerous-error safety profile, and the first to produce identical results across thinking modes.

For a full merged version (no PEFT required at inference), see vadimbelsky/qwen3.5-esi-triage-grpo-v49-merged.


What changed from v46

Error triage on v46's 8 wrong cases revealed three rule-application failures:

  1. Missed "lifesaving intervention already performed β†’ ESI 1" (3 cases) β€” narratives with "intubated", "chest tube placed", "central line placed" not recognized as ESI-1 Step A triggers.
  2. Missed severe pain rule (1 case) β€” pain β‰₯ 7 should anchor ESI ≀ 2 unless ESI-1 criteria are present.
  3. Missed open injury rule (1 case) β€” open fractures and penetrating trauma should anchor ESI ≀ 2.

v49 adds rule-aware reward bonuses:

Trigger in case text (regex) Reward modifier
`intubat chest tube
`intubat chest tube
`open fracture penetrating
Pain β‰₯ 7 + gold = 2 + pred > 2 βˆ’0.5

Other changes (informed by v48's failure):

  • Training budget raised 512 β†’ 1024 tokens (matches eval, eliminates clipping)
  • No-parse penalty hardened βˆ’0.5 β†’ βˆ’2.0 (must dominate every wrong commitment)
  • Warm-start from v46, 300 steps at LR 2e-7 (refinement, not relearning)

Training metrics

v49 was the cleanest GRPO run of this series:

  • clipped_ratio held at 4–19% throughout (vs v48's 90%+)
  • reward positive from step 10, peaked at +0.49
  • reward_std consistently ~1.0+ (strong GRPO learning signal)
  • 300 steps in 17h 48m on NVIDIA GB10

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base    = "Qwen/Qwen3.5-9B"
adapter = "vadimbelsky/qwen3.5-esi-triage-grpo-v49"

tokenizer = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, adapter)
model.eval()

SYSTEM = (
    "You are an expert emergency triage nurse. "
    "Extract clinical fields, apply the ESI algorithm step by step, then state the ESI level. "
    "Be concise β€” stay under 150 words total."
)

case = ("A 78-year-old female arrived intubated for airway protection. "
        "Central line placed. BP 120/58, HR 150, RR 20, SpO2 97%.")

prompt = tokenizer.apply_chat_template(
    [{"role": "system", "content": SYSTEM},
     {"role": "user",   "content": case}],
    tokenize=False, add_generation_prompt=True,
)
out = model.generate(
    **tokenizer(prompt, return_tensors="pt").to(model.device),
    max_new_tokens=1024, temperature=0.1, do_sample=True,
)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Limitations

Research model. Not approved for clinical use. Rule bonuses target specific regex patterns measured failing in v46 β€” they don't generalize. See the v46 model card for the full design journey and failure-mode lessons that shaped v49.

Downloads last month
30
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for vadimbelsky/qwen3.5-esi-triage-grpo-v49

Finetuned
Qwen/Qwen3.5-9B
Adapter
(195)
this model