--- license: apache-2.0 tags: - activation-steering - sparse-autoencoder - contrastive-activation-addition - qwen - interpretability library_name: pytorch base_model: - Qwen/Qwen3.5-0.8B - Qwen/Qwen3.5-2B - Qwen/Qwen3.5-4B --- # Qwen3.5 Mature-Anger Steering Artifacts Steering vectors, sparse autoencoders, and cross-size transfer maps from a home-lab study of activation steering on Qwen3.5 models. See the [code + full report on GitHub](https://github.com/Gussyy/qwen35-mature-anger-steering) and the paired dataset repo [`Rachata/qwen35-mature-anger-data`](https://huggingface.co/datasets/Rachata/qwen35-mature-anger-data). ## What's in here | Path | Contents | Size | |---|---|---| | `vectors/qwen_large_L{6,10,14,18,22}_caa.pt` | Contrastive Activation Addition vectors for Qwen3.5-2B, per layer | ~10 KB each | | `vectors/qwen_small_L{6,10,14,18,22}_caa.pt` | CAA vectors for Qwen3.5-0.8B | ~6 KB each | | `vectors/qwen_xlarge_L{13,18,23}_caa.pt` | CAA vectors for Qwen3.5-4B | ~12 KB each | | `vectors/qwen_small_transferred_caa.pt` | Cross-size-transferred vectors (ridge / Procrustes / random baselines) | 27 KB | | `vectors/transfer_map_large_to_small.pt` | Ridge + Procrustes alignment maps between 2B and 0.8B residual spaces | 24 MB | | `saes/qwen_large_L14_sae.pt` | Top-K SAE (d_sae=8192, k=32) on Qwen3.5-2B at layer 14 | 129 MB | | `saes/qwen_small_L14_sae.pt` | Top-K SAE (d_sae=4096, k=32) on Qwen3.5-0.8B at layer 14 | 33 MB | | `saes/*_features.json` | Top-30 features ranked by steered-vs-base activation delta | <10 KB each | ## How to use ```python import torch from huggingface_hub import hf_hub_download path = hf_hub_download("Rachata/qwen35-mature-anger-steering", "vectors/qwen_large_L14_caa.pt") caa = torch.load(path, weights_only=True) v = caa["vector"] # shape: (2048,) ``` Apply as a forward hook on `Qwen/Qwen3.5-2B` at `model.model.layers[14]` with coefficient `c=+1.0`: ```python def hook(m, inp, out): resid = out[0] if isinstance(out, tuple) else out resid = resid + 1.0 * v.to(resid.device, resid.dtype) return (resid,) + out[1:] if isinstance(out, tuple) else resid h = model.model.layers[14].register_forward_hook(hook) # generate... h.remove() ``` ## Key results (DeepSeek-judged, 1--5 rubric) | Model | Best cell | mature_anger | juvenile_rage | coherence | Margin | PPL | |-------|-----------|--------------|---------------|-----------|--------|-----| | 0.8B | L=6 c=+1 | 1.0 | 1.0 | 4.0 | **0.0** | 1.05x | | 2B | L=14 c=+1 | 3.5 | 1.0 | 5.0 | **+2.5** | 1.13x | | 4B | L=13 c=+1 | 4.5 | 1.0 | 5.0 | **+3.5** | 1.11x | The 2B SAE contains a dedicated mature-anger feature (id 4617) that spikes ~16x (base mean 0.113 -> steered mean 1.789, fires on 100% of tokens under steering). The 0.8B has no comparable concentrated feature, and no no-training transfer method (ridge, Procrustes, activation patching, SAE-feature clamping, multi-layer stacking) successfully elicits the persona -- only a system-prompt anchor does. Full methodology, every sweep cell, every judge score in the [GitHub repo's `report.md`](https://github.com/Gussyy/qwen35-mature-anger-steering/blob/main/report.md). ## Citation ```bibtex @misc{qwen35-mature-anger-steering, title = {Qwen3.5 Mature-Anger Steering: A Home-Lab Study of Activation Steering and Cross-Size Transfer}, author = {Rachata}, year = {2026}, url = {https://huggingface.co/Rachata/qwen35-mature-anger-steering} } ```