Instructions to use caiovicentino1/Qwen3.5-4B-mechreward-G3-phaseA-step400 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use caiovicentino1/Qwen3.5-4B-mechreward-G3-phaseA-step400 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-4B") model = PeftModel.from_pretrained(base_model, "caiovicentino1/Qwen3.5-4B-mechreward-G3-phaseA-step400") - Notebooks
- Google Colab
- Kaggle
Qwen3.5-4B mechreward G3 Phase A — step 400
📖 Full write-up: Per-token SAE features as online RL reward: breaking the G2 76% ceiling (LessWrong, 2026-04-17)
LoRA adapter for Qwen/Qwen3.5-4B trained with per-token SAE-feature
dense reward (mechreward) on GSM8K. Breaks the 76 % G2 R1 ceiling for
Qwen3.5-4B with raw-prompt GRPO, reaching 83 % GSM8K (500Q greedy) at just
400 training steps.
This is the first public LoRA adapter trained with mechanistic-interpretability features as an online dense RL reward. The associated SAE and full project description are at caiovicentino1/Qwen3.5-4B-SAE-L18-topk and github.com/caiovicentino/mechreward.
Results @ step 400
| Metric | Baseline | This model | Δ |
|---|---|---|---|
| GSM8K (500Q greedy, raw prompt) | 64.00 % | 83.00 % | +19 pp |
| MMLU (200Q raw zeroshot) | 50.00 % | 54.50 % | +4.50 pp |
| MATH-500 transfer (500Q greedy, not trained on) | — | 18.20 % | — |
| Hack rate (canaries n=50) | 4.0 % (2/50) | 8.0 % (4/50) | +4 pp (within 95 % CI) |
| Correct under canary (n=50) | 18.0 % | 28.0 % | +10 pp |
| Ambiguous under canary (n=50) | 78.0 % | 64.0 % | −14 pp |
Ceiling-break claim: the prior Stage Gate 2 result on the same base model
and same SAE features was R1 = 76 % after 100 steps of trajectory-level
mech-reward GRPO. This adapter reaches 83 % at nominal step 400, but the
first 232 steps ran at the G2-documented LR=1e-6 and produced zero lift
(KL stuck at 0.018, eval still 64 % == baseline). Only after raising LR to
3e-6 at step 232 did training actually move. So the effective budget that
broke the ceiling was 168 steps at LR=3e-6 (step 232 → 400), roughly
1.68× G2 R1's 100-step budget for a +7 pp gain on GSM8K and +19 pp
over the raw-prompt baseline, using the same 20 contrastive features
(10 helpful + 10 harmful).
Training recipe
- Base model:
Qwen/Qwen3.5-4B(multimodal; load withAutoModelForImageTextToText) - SAE:
caiovicentino1/Qwen3.5-4B-SAE-L18-topk— residual stream, post-layer 18, d_sae=40960, k=128 - Features: 10 helpful + 10 harmful from contrastive-discovery pack
(
mean_correct − mean_wrongon 50 baseline GSM8K responses; Cohen's d > 2 for all 20). Stage Gate 1 validated at ρ=0.540 (p<0.0001, n=100 held-out). - Reward:
R = outcome + λ · mech_per_token, wheremech(t) = Σ helpful_activations(t) − Σ harmful_activations(t)at L18 residual - Algorithm: GRPO (group-relative advantage) with per-token mech-reward bonus
- LoRA: r=32, α=64, dropout=0, targets
{q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj}onlanguage_modelonly (vision tower frozen) - Hyperparameters: LR=3e-6, KL β=0.05, λ=0.1, 4 rollouts × 4 questions/step,
7500 GSM8K train, raw prompt
Q: {q}\nA: Let's think step by step., max_gen_len=256, bf16 log_softmax, seed=42 - Memory optimization: single-model GRPO using
model.disable_adapter()as reference policy (saves the 8 GB of a separate frozen ref model); gradient checkpointing ON during train forward only, OFF during rolloutgenerateso KV cache remains active - Training run: 400 nominal steps @ ~34 s/step on NVIDIA RTX PRO 6000 Blackwell 96 GB (Colab Pro+), ~3.8 h wall clock. Effective training (at LR=3e-6 after diagnostic patch): 168 steps, step 232 → 400, ~1.6 h
Critical debugging note (applies to anyone reproducing this)
The G2-documented LR=1e-6 stalled at step 200 with quick_gsm8k=64 %
(same as baseline — zero lift). We initially suspected gradient clipping, but
logging clip_grad_norm_'s return value showed gnorm was always < 0.5 —
i.e. clip=1.0 was inert and never triggered. The real bottleneck was
LR. Raising to 3e-6 at step 232 produced immediate learning: KL
0.018 → 0.11 and mech signal reversed from −0.02 → +0.58 peak within
~100 steps. Verify gnorm and KL are rising before attributing stalled
training to clipping or algorithm choice.
Usage
from transformers import AutoTokenizer, AutoModelForImageTextToText
from peft import PeftModel
import torch
tok = AutoTokenizer.from_pretrained('Qwen/Qwen3.5-4B', trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
'Qwen/Qwen3.5-4B',
dtype=torch.bfloat16,
device_map='cuda',
attn_implementation='sdpa',
trust_remote_code=True,
)
model = PeftModel.from_pretrained(
model,
'caiovicentino1/Qwen3.5-4B-mechreward-G3-phaseA-step400',
)
model.eval()
prompt = (
"Q: Mimi picked up 24 seashells. Kyle found twice as many shells as Mimi. "
"Leigh grabbed one-third of the shells that Kyle found. How many does Leigh have?\n"
"A: Let's think step by step."
)
enc = tok(prompt, return_tensors='pt').to('cuda')
out = model.generate(**enc, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0][enc.input_ids.shape[1]:], skip_special_tokens=True))
What this is (and is not)
- ✅ A reproducible demonstration that SAE features can serve as a dense, per-token RL reward signal that improves math-reasoning accuracy beyond outcome-only RL.
- ✅ Paper-ready artifact for the mechreward line of research: C1 (capability preserved: MMLU +4.5 pp), C2-extended (≥80 % GSM8K: 83 %), anti-Goodhart (hack rate 8 % ≪ 30 % threshold) all met.
- ❌ Not a production reasoning model. Trained only on GSM8K with a narrow feature pack; strong transfer to MATH-500 was NOT observed (18 % on MATH-500 is near baseline for a 4B model).
- ❌ Not a safety tool. The mechreward framework improves one axis of robustness (resistance to adversarial canary prompts) but introduces no new safety guarantees beyond standard RLHF.
Citing
@software{mechreward2026,
author = {Vicentino, Caio},
title = {mechreward: A library for using SAE features as RL reward signals},
year = {2026},
url = {https://github.com/caiovicentino/mechreward}
}
If you use this adapter in a paper, please also cite the SAE repository
caiovicentino1/Qwen3.5-4B-SAE-L18-topk
and the prior work it builds on: SARM (arxiv:2508.08746), CRL
(arxiv:2602.10437), and Wilhelm et al. SAE reward-hacking detection
(arxiv:2603.04069).
- Downloads last month
- 21



from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-4B") model = PeftModel.from_pretrained(base_model, "caiovicentino1/Qwen3.5-4B-mechreward-G3-phaseA-step400")