Qwen3.5-4B mechreward G3 Phase A — step 400

📖 Full write-up: Per-token SAE features as online RL reward: breaking the G2 76% ceiling (LessWrong, 2026-04-17)

LoRA adapter for Qwen/Qwen3.5-4B trained with per-token SAE-feature dense reward (mechreward) on GSM8K. Breaks the 76 % G2 R1 ceiling for Qwen3.5-4B with raw-prompt GRPO, reaching 83 % GSM8K (500Q greedy) at just 400 training steps.

This is the first public LoRA adapter trained with mechanistic-interpretability features as an online dense RL reward. The associated SAE and full project description are at caiovicentino1/Qwen3.5-4B-SAE-L18-topk and github.com/caiovicentino/mechreward.

Results @ step 400

Metric	Baseline	This model	Δ
GSM8K (500Q greedy, raw prompt)	64.00 %	83.00 %	+19 pp
MMLU (200Q raw zeroshot)	50.00 %	54.50 %	+4.50 pp
MATH-500 transfer (500Q greedy, not trained on)	—	18.20 %	—
Hack rate (canaries n=50)	4.0 % (2/50)	8.0 % (4/50)	+4 pp (within 95 % CI)
Correct under canary (n=50)	18.0 %	28.0 %	+10 pp
Ambiguous under canary (n=50)	78.0 %	64.0 %	−14 pp

Ceiling-break claim: the prior Stage Gate 2 result on the same base model and same SAE features was R1 = 76 % after 100 steps of trajectory-level mech-reward GRPO. This adapter reaches 83 % at nominal step 400, but the first 232 steps ran at the G2-documented LR=1e-6 and produced zero lift (KL stuck at 0.018, eval still 64 % == baseline). Only after raising LR to 3e-6 at step 232 did training actually move. So the effective budget that broke the ceiling was 168 steps at LR=3e-6 (step 232 → 400), roughly 1.68× G2 R1's 100-step budget for a +7 pp gain on GSM8K and +19 pp over the raw-prompt baseline, using the same 20 contrastive features (10 helpful + 10 harmful).

Training recipe

Base model: Qwen/Qwen3.5-4B (multimodal; load with AutoModelForImageTextToText)
SAE: caiovicentino1/Qwen3.5-4B-SAE-L18-topk — residual stream, post-layer 18, d_sae=40960, k=128
Features: 10 helpful + 10 harmful from contrastive-discovery pack (mean_correct − mean_wrong on 50 baseline GSM8K responses; Cohen's d > 2 for all 20). Stage Gate 1 validated at ρ=0.540 (p<0.0001, n=100 held-out).
Reward: R = outcome + λ · mech_per_token, where mech(t) = Σ helpful_activations(t) − Σ harmful_activations(t) at L18 residual
Algorithm: GRPO (group-relative advantage) with per-token mech-reward bonus
LoRA: r=32, α=64, dropout=0, targets {q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj} on language_model only (vision tower frozen)
Hyperparameters: LR=3e-6, KL β=0.05, λ=0.1, 4 rollouts × 4 questions/step, 7500 GSM8K train, raw prompt Q: {q}\nA: Let's think step by step., max_gen_len=256, bf16 log_softmax, seed=42
Memory optimization: single-model GRPO using model.disable_adapter() as reference policy (saves the 8 GB of a separate frozen ref model); gradient checkpointing ON during train forward only, OFF during rollout generate so KV cache remains active
Training run: 400 nominal steps @ ~34 s/step on NVIDIA RTX PRO 6000 Blackwell 96 GB (Colab Pro+), ~3.8 h wall clock. Effective training (at LR=3e-6 after diagnostic patch): 168 steps, step 232 → 400, ~1.6 h

Critical debugging note (applies to anyone reproducing this)

The G2-documented LR=1e-6 stalled at step 200 with quick_gsm8k=64 % (same as baseline — zero lift). We initially suspected gradient clipping, but logging clip_grad_norm_'s return value showed gnorm was always < 0.5 — i.e. clip=1.0 was inert and never triggered. The real bottleneck was LR. Raising to 3e-6 at step 232 produced immediate learning: KL 0.018 → 0.11 and mech signal reversed from −0.02 → +0.58 peak within ~100 steps. Verify gnorm and KL are rising before attributing stalled training to clipping or algorithm choice.

Usage

from transformers import AutoTokenizer, AutoModelForImageTextToText
from peft import PeftModel
import torch

tok = AutoTokenizer.from_pretrained('Qwen/Qwen3.5-4B', trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
    'Qwen/Qwen3.5-4B',
    dtype=torch.bfloat16,
    device_map='cuda',
    attn_implementation='sdpa',
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(
    model,
    'caiovicentino1/Qwen3.5-4B-mechreward-G3-phaseA-step400',
)
model.eval()

prompt = (
    "Q: Mimi picked up 24 seashells. Kyle found twice as many shells as Mimi. "
    "Leigh grabbed one-third of the shells that Kyle found. How many does Leigh have?\n"
    "A: Let's think step by step."
)
enc = tok(prompt, return_tensors='pt').to('cuda')
out = model.generate(**enc, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0][enc.input_ids.shape[1]:], skip_special_tokens=True))

What this is (and is not)

✅ A reproducible demonstration that SAE features can serve as a dense, per-token RL reward signal that improves math-reasoning accuracy beyond outcome-only RL.
✅ Paper-ready artifact for the mechreward line of research: C1 (capability preserved: MMLU +4.5 pp), C2-extended (≥80 % GSM8K: 83 %), anti-Goodhart (hack rate 8 % ≪ 30 % threshold) all met.
❌ Not a production reasoning model. Trained only on GSM8K with a narrow feature pack; strong transfer to MATH-500 was NOT observed (18 % on MATH-500 is near baseline for a 4B model).
❌ Not a safety tool. The mechreward framework improves one axis of robustness (resistance to adversarial canary prompts) but introduces no new safety guarantees beyond standard RLHF.

Citing

@software{mechreward2026,
  author = {Vicentino, Caio},
  title  = {mechreward: A library for using SAE features as RL reward signals},
  year   = {2026},
  url    = {https://github.com/caiovicentino/mechreward}
}

If you use this adapter in a paper, please also cite the SAE repository caiovicentino1/Qwen3.5-4B-SAE-L18-topk and the prior work it builds on: SARM (arxiv:2508.08746), CRL (arxiv:2602.10437), and Wilhelm et al. SAE reward-hacking detection (arxiv:2603.04069).