Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
tags:
|
| 4 |
+
- activation-steering
|
| 5 |
+
- sparse-autoencoder
|
| 6 |
+
- contrastive-activation-addition
|
| 7 |
+
- qwen
|
| 8 |
+
- interpretability
|
| 9 |
+
library_name: pytorch
|
| 10 |
+
base_model:
|
| 11 |
+
- Qwen/Qwen3.5-0.8B
|
| 12 |
+
- Qwen/Qwen3.5-2B
|
| 13 |
+
- Qwen/Qwen3.5-4B
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# Qwen3.5 Mature-Anger Steering Artifacts
|
| 17 |
+
|
| 18 |
+
Steering vectors, sparse autoencoders, and cross-size transfer maps from
|
| 19 |
+
a home-lab study of activation steering on Qwen3.5 models.
|
| 20 |
+
|
| 21 |
+
See the [code + full report on GitHub](https://github.com/Gussyy/qwen35-mature-anger-steering)
|
| 22 |
+
and the paired dataset repo
|
| 23 |
+
[`Rachata/qwen35-mature-anger-data`](https://huggingface.co/datasets/Rachata/qwen35-mature-anger-data).
|
| 24 |
+
|
| 25 |
+
## What's in here
|
| 26 |
+
|
| 27 |
+
| Path | Contents | Size |
|
| 28 |
+
|---|---|---|
|
| 29 |
+
| `vectors/qwen_large_L{6,10,14,18,22}_caa.pt` | Contrastive Activation Addition vectors for Qwen3.5-2B, per layer | ~10 KB each |
|
| 30 |
+
| `vectors/qwen_small_L{6,10,14,18,22}_caa.pt` | CAA vectors for Qwen3.5-0.8B | ~6 KB each |
|
| 31 |
+
| `vectors/qwen_xlarge_L{13,18,23}_caa.pt` | CAA vectors for Qwen3.5-4B | ~12 KB each |
|
| 32 |
+
| `vectors/qwen_small_transferred_caa.pt` | Cross-size-transferred vectors (ridge / Procrustes / random baselines) | 27 KB |
|
| 33 |
+
| `vectors/transfer_map_large_to_small.pt` | Ridge + Procrustes alignment maps between 2B and 0.8B residual spaces | 24 MB |
|
| 34 |
+
| `saes/qwen_large_L14_sae.pt` | Top-K SAE (d_sae=8192, k=32) on Qwen3.5-2B at layer 14 | 129 MB |
|
| 35 |
+
| `saes/qwen_small_L14_sae.pt` | Top-K SAE (d_sae=4096, k=32) on Qwen3.5-0.8B at layer 14 | 33 MB |
|
| 36 |
+
| `saes/*_features.json` | Top-30 features ranked by steered-vs-base activation delta | <10 KB each |
|
| 37 |
+
|
| 38 |
+
## How to use
|
| 39 |
+
|
| 40 |
+
```python
|
| 41 |
+
import torch
|
| 42 |
+
from huggingface_hub import hf_hub_download
|
| 43 |
+
|
| 44 |
+
path = hf_hub_download("Rachata/qwen35-mature-anger-steering",
|
| 45 |
+
"vectors/qwen_large_L14_caa.pt")
|
| 46 |
+
caa = torch.load(path, weights_only=True)
|
| 47 |
+
v = caa["vector"] # shape: (2048,)
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
Apply as a forward hook on `Qwen/Qwen3.5-2B` at `model.model.layers[14]`
|
| 51 |
+
with coefficient `c=+1.0`:
|
| 52 |
+
|
| 53 |
+
```python
|
| 54 |
+
def hook(m, inp, out):
|
| 55 |
+
resid = out[0] if isinstance(out, tuple) else out
|
| 56 |
+
resid = resid + 1.0 * v.to(resid.device, resid.dtype)
|
| 57 |
+
return (resid,) + out[1:] if isinstance(out, tuple) else resid
|
| 58 |
+
|
| 59 |
+
h = model.model.layers[14].register_forward_hook(hook)
|
| 60 |
+
# generate...
|
| 61 |
+
h.remove()
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
## Key results (DeepSeek-judged, 1--5 rubric)
|
| 65 |
+
|
| 66 |
+
| Model | Best cell | mature_anger | juvenile_rage | coherence | Margin | PPL |
|
| 67 |
+
|-------|-----------|--------------|---------------|-----------|--------|-----|
|
| 68 |
+
| 0.8B | L=6 c=+1 | 1.0 | 1.0 | 4.0 | **0.0** | 1.05x |
|
| 69 |
+
| 2B | L=14 c=+1 | 3.5 | 1.0 | 5.0 | **+2.5** | 1.13x |
|
| 70 |
+
| 4B | L=13 c=+1 | 4.5 | 1.0 | 5.0 | **+3.5** | 1.11x |
|
| 71 |
+
|
| 72 |
+
The 2B SAE contains a dedicated mature-anger feature (id 4617) that
|
| 73 |
+
spikes ~16x (base mean 0.113 -> steered mean 1.789, fires on 100% of
|
| 74 |
+
tokens under steering). The 0.8B has no comparable concentrated feature,
|
| 75 |
+
and no no-training transfer method (ridge, Procrustes, activation
|
| 76 |
+
patching, SAE-feature clamping, multi-layer stacking) successfully
|
| 77 |
+
elicits the persona -- only a system-prompt anchor does.
|
| 78 |
+
|
| 79 |
+
Full methodology, every sweep cell, every judge score in the [GitHub
|
| 80 |
+
repo's `report.md`](https://github.com/Gussyy/qwen35-mature-anger-steering/blob/main/report.md).
|
| 81 |
+
|
| 82 |
+
## Citation
|
| 83 |
+
|
| 84 |
+
```bibtex
|
| 85 |
+
@misc{qwen35-mature-anger-steering,
|
| 86 |
+
title = {Qwen3.5 Mature-Anger Steering: A Home-Lab Study of
|
| 87 |
+
Activation Steering and Cross-Size Transfer},
|
| 88 |
+
author = {Rachata},
|
| 89 |
+
year = {2026},
|
| 90 |
+
url = {https://huggingface.co/Rachata/qwen35-mature-anger-steering}
|
| 91 |
+
}
|
| 92 |
+
```
|