Qwen3.5 Mature-Anger Steering Artifacts

Steering vectors, sparse autoencoders, and cross-size transfer maps from a home-lab study of activation steering on Qwen3.5 models.

See the code + full report on GitHub and the paired dataset repo Rachata/qwen35-mature-anger-data.

What's in here

Path	Contents	Size
`vectors/qwen_large_L{6,10,14,18,22}_caa.pt`	Contrastive Activation Addition vectors for Qwen3.5-2B, per layer	~10 KB each
`vectors/qwen_small_L{6,10,14,18,22}_caa.pt`	CAA vectors for Qwen3.5-0.8B	~6 KB each
`vectors/qwen_xlarge_L{13,18,23}_caa.pt`	CAA vectors for Qwen3.5-4B	~12 KB each
`vectors/qwen_small_transferred_caa.pt`	Cross-size-transferred vectors (ridge / Procrustes / random baselines)	27 KB
`vectors/transfer_map_large_to_small.pt`	Ridge + Procrustes alignment maps between 2B and 0.8B residual spaces	24 MB
`saes/qwen_large_L14_sae.pt`	Top-K SAE (d_sae=8192, k=32) on Qwen3.5-2B at layer 14	129 MB
`saes/qwen_small_L14_sae.pt`	Top-K SAE (d_sae=4096, k=32) on Qwen3.5-0.8B at layer 14	33 MB
`saes/*_features.json`	Top-30 features ranked by steered-vs-base activation delta	<10 KB each

How to use

import torch
from huggingface_hub import hf_hub_download

path = hf_hub_download("Rachata/qwen35-mature-anger-steering",
                      "vectors/qwen_large_L14_caa.pt")
caa = torch.load(path, weights_only=True)
v = caa["vector"]   # shape: (2048,)

Apply as a forward hook on Qwen/Qwen3.5-2B at model.model.layers[14] with coefficient c=+1.0:

def hook(m, inp, out):
    resid = out[0] if isinstance(out, tuple) else out
    resid = resid + 1.0 * v.to(resid.device, resid.dtype)
    return (resid,) + out[1:] if isinstance(out, tuple) else resid

h = model.model.layers[14].register_forward_hook(hook)
# generate...
h.remove()

Key results (DeepSeek-judged, 1--5 rubric)

Model	Best cell	mature_anger	juvenile_rage	coherence	Margin	PPL
0.8B	L=6 c=+1	1.0	1.0	4.0	0.0	1.05x
2B	L=14 c=+1	3.5	1.0	5.0	+2.5	1.13x
4B	L=13 c=+1	4.5	1.0	5.0	+3.5	1.11x

The 2B SAE contains a dedicated mature-anger feature (id 4617) that spikes ~16x (base mean 0.113 -> steered mean 1.789, fires on 100% of tokens under steering). The 0.8B has no comparable concentrated feature, and no no-training transfer method (ridge, Procrustes, activation patching, SAE-feature clamping, multi-layer stacking) successfully elicits the persona -- only a system-prompt anchor does.

Full methodology, every sweep cell, every judge score in the GitHub repo's report.md.

Citation

@misc{qwen35-mature-anger-steering,
  title = {Qwen3.5 Mature-Anger Steering: A Home-Lab Study of
           Activation Steering and Cross-Size Transfer},
  author = {Rachata},
  year = {2026},
  url = {https://huggingface.co/Rachata/qwen35-mature-anger-steering}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Rachata/qwen35-mature-anger-steering

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Finetuned

(208)

this model