Rachata commited on
Commit
cd2ddbf
·
verified ·
1 Parent(s): 8c017e7

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +92 -0
README.md ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ tags:
4
+ - activation-steering
5
+ - sparse-autoencoder
6
+ - contrastive-activation-addition
7
+ - qwen
8
+ - interpretability
9
+ library_name: pytorch
10
+ base_model:
11
+ - Qwen/Qwen3.5-0.8B
12
+ - Qwen/Qwen3.5-2B
13
+ - Qwen/Qwen3.5-4B
14
+ ---
15
+
16
+ # Qwen3.5 Mature-Anger Steering Artifacts
17
+
18
+ Steering vectors, sparse autoencoders, and cross-size transfer maps from
19
+ a home-lab study of activation steering on Qwen3.5 models.
20
+
21
+ See the [code + full report on GitHub](https://github.com/Gussyy/qwen35-mature-anger-steering)
22
+ and the paired dataset repo
23
+ [`Rachata/qwen35-mature-anger-data`](https://huggingface.co/datasets/Rachata/qwen35-mature-anger-data).
24
+
25
+ ## What's in here
26
+
27
+ | Path | Contents | Size |
28
+ |---|---|---|
29
+ | `vectors/qwen_large_L{6,10,14,18,22}_caa.pt` | Contrastive Activation Addition vectors for Qwen3.5-2B, per layer | ~10 KB each |
30
+ | `vectors/qwen_small_L{6,10,14,18,22}_caa.pt` | CAA vectors for Qwen3.5-0.8B | ~6 KB each |
31
+ | `vectors/qwen_xlarge_L{13,18,23}_caa.pt` | CAA vectors for Qwen3.5-4B | ~12 KB each |
32
+ | `vectors/qwen_small_transferred_caa.pt` | Cross-size-transferred vectors (ridge / Procrustes / random baselines) | 27 KB |
33
+ | `vectors/transfer_map_large_to_small.pt` | Ridge + Procrustes alignment maps between 2B and 0.8B residual spaces | 24 MB |
34
+ | `saes/qwen_large_L14_sae.pt` | Top-K SAE (d_sae=8192, k=32) on Qwen3.5-2B at layer 14 | 129 MB |
35
+ | `saes/qwen_small_L14_sae.pt` | Top-K SAE (d_sae=4096, k=32) on Qwen3.5-0.8B at layer 14 | 33 MB |
36
+ | `saes/*_features.json` | Top-30 features ranked by steered-vs-base activation delta | <10 KB each |
37
+
38
+ ## How to use
39
+
40
+ ```python
41
+ import torch
42
+ from huggingface_hub import hf_hub_download
43
+
44
+ path = hf_hub_download("Rachata/qwen35-mature-anger-steering",
45
+ "vectors/qwen_large_L14_caa.pt")
46
+ caa = torch.load(path, weights_only=True)
47
+ v = caa["vector"] # shape: (2048,)
48
+ ```
49
+
50
+ Apply as a forward hook on `Qwen/Qwen3.5-2B` at `model.model.layers[14]`
51
+ with coefficient `c=+1.0`:
52
+
53
+ ```python
54
+ def hook(m, inp, out):
55
+ resid = out[0] if isinstance(out, tuple) else out
56
+ resid = resid + 1.0 * v.to(resid.device, resid.dtype)
57
+ return (resid,) + out[1:] if isinstance(out, tuple) else resid
58
+
59
+ h = model.model.layers[14].register_forward_hook(hook)
60
+ # generate...
61
+ h.remove()
62
+ ```
63
+
64
+ ## Key results (DeepSeek-judged, 1--5 rubric)
65
+
66
+ | Model | Best cell | mature_anger | juvenile_rage | coherence | Margin | PPL |
67
+ |-------|-----------|--------------|---------------|-----------|--------|-----|
68
+ | 0.8B | L=6 c=+1 | 1.0 | 1.0 | 4.0 | **0.0** | 1.05x |
69
+ | 2B | L=14 c=+1 | 3.5 | 1.0 | 5.0 | **+2.5** | 1.13x |
70
+ | 4B | L=13 c=+1 | 4.5 | 1.0 | 5.0 | **+3.5** | 1.11x |
71
+
72
+ The 2B SAE contains a dedicated mature-anger feature (id 4617) that
73
+ spikes ~16x (base mean 0.113 -> steered mean 1.789, fires on 100% of
74
+ tokens under steering). The 0.8B has no comparable concentrated feature,
75
+ and no no-training transfer method (ridge, Procrustes, activation
76
+ patching, SAE-feature clamping, multi-layer stacking) successfully
77
+ elicits the persona -- only a system-prompt anchor does.
78
+
79
+ Full methodology, every sweep cell, every judge score in the [GitHub
80
+ repo's `report.md`](https://github.com/Gussyy/qwen35-mature-anger-steering/blob/main/report.md).
81
+
82
+ ## Citation
83
+
84
+ ```bibtex
85
+ @misc{qwen35-mature-anger-steering,
86
+ title = {Qwen3.5 Mature-Anger Steering: A Home-Lab Study of
87
+ Activation Steering and Cross-Size Transfer},
88
+ author = {Rachata},
89
+ year = {2026},
90
+ url = {https://huggingface.co/Rachata/qwen35-mature-anger-steering}
91
+ }
92
+ ```