Qwen3.5-35B-A3B TopK SAEs — Phase 2 (Replication of Venhoff et al. 2025)

TopK Sparse Autoencoders trained on L17 residual-stream activations of Qwen/Qwen3.5-35B-A3B thinking traces.

Recipe (Venhoff et al. 2025, arXiv:2510.07364)

TopK activation, k=3
Decoder row-normalized after each step
MSE loss (sparsity via TopK, no L1 penalty)
TinySAE lr schedule: 2e-4 / sqrt(n/16384)
Adam, batch 512, max 300 epochs, patience 10, 90/10 train/val split
Activation source: sentence-level mean-pool (41285 sentences from 2000 MMLU-Pro prompts)

Dict sizes swept

n	val MSE	var explained	cos sim
5	0.0007	0.066	0.870
10	0.0007	0.090	0.877
15	0.0007	0.110	0.882
20	0.0007	0.123	0.886
25	0.0007	0.132	0.888

Selected (elbow): n=15

Files

sae_n{5,10,15,20,25}.pt — all trained SAEs
cluster_data_n15.json — per-cluster top-100 + random-100 sentences (for LLM labeling in Phase 3)
summary.json — training metrics per dict size
sweep.png — MSE vs var-explained curves

Next (Phase 3)

Label clusters with GPT-4o-mini using top-100 activating sentences → 10-20 named reasoning categories. Then train per-category steering vectors (Phase 4) and run hybrid inference on MATH500/GSM8K (Phase 5).