dejanseo commited on
Commit
4fd2ac6
·
verified ·
1 Parent(s): ce956d5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +142 -60
README.md CHANGED
@@ -21,16 +21,134 @@ Anthropic demonstrated that Claude Sonnet 4.5 contains 171 internal linear repre
21
 
22
  ## Status
23
 
24
- **In progress.** Extraction is running across multiple layers. Results will be updated as each layer completes.
25
 
26
  | Step | Status | Details |
27
  |------|--------|---------|
28
  | Story generation | Complete | 171,000 stories (171 emotions x 100 topics x 10 stories) |
29
  | Neutral dialogues | Complete | 1,200 dialogues (100 topics x 12 dialogues) |
30
- | Vector extraction | In progress | Layers 5, 10 done. Layers 15-55 running (~14h per layer) |
31
- | Analysis | Pending | Cosine similarity, PCA, clustering |
32
- | External validation | Pending | The Pile, LMSYS Chat 1M |
33
- | Steering experiments | Pending | Blackmail/desperation replication |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
 
35
  ## Methodology
36
 
@@ -46,58 +164,19 @@ Follows Anthropic's exact methodology:
46
 
47
  5. **Denoising**: SVD on neutral dialogue activations, project out top principal components explaining 50% of variance. This removes non-emotional signal (syntax, topic, style).
48
 
49
- 6. **Logit lens**: Project emotion vectors through the unembedding matrix to see which tokens each vector promotes/suppresses.
50
 
51
- 7. **PCA**: Principal component analysis on the 171 emotion vectors to identify the dominant axes of variation.
52
 
53
- ## Early Results (Layers 5 and 10)
54
-
55
- ### Layer 10 PCA
56
-
57
- | Component | Variance Explained |
58
- |-----------|-------------------|
59
- | PC1 | 38.9% |
60
- | PC2 | 14.0% |
61
- | PC3 | 10.1% |
62
- | PC4 | 6.7% |
63
- | PC5 | 5.2% |
64
- | **Total (5 PCs)** | **74.9%** |
65
-
66
- **PC1 = Valence axis** (38.9% variance)
67
- - Positive end: optimistic, kind, cheerful, playful, happy
68
- - Negative end: hysterical, terrified, tormented, scared, disturbed
69
-
70
- **PC2 = Disposition axis** (14.0% variance)
71
- - Top: stubborn, vindictive, obstinate, spiteful, vengeful
72
- - Bottom: serene, peaceful, nostalgic, at ease, sentimental
73
-
74
- PC2 does not map cleanly to Russell's arousal dimension. It appears to separate hostile/oppositional dispositions from tranquil/reflective ones. This is consistent with our earlier 20-emotion finding on 31B where PC2 captured an "externally-settled vs internally-processing" axis rather than arousal.
75
-
76
- ### Denoising
77
-
78
- 10 neutral components projected out, explaining 50.5% of neutral activation variance.
79
-
80
- ### Logit Lens
81
-
82
- At layers 5 and 10 with 4-bit quantization, logit lens results are noisy (surface subword fragments and internal tokens rather than semantically meaningful words). This is expected. Logit lens becomes more interpretable at deeper layers where representations are closer to the output space. The vectors themselves are unaffected by quantization noise. PCA, cosine similarity, and steering all operate on the vectors directly and do not go through the unembedding matrix.
83
 
84
  ## Model
85
 
86
  - **Model**: google/gemma-4-31B-it
87
  - **Quantization**: 4-bit via BitsAndBytesConfig (fits 24GB VRAM on RTX 4090)
88
- - **Layers**: 60 total, extracting at 11 target layers
89
  - **Hidden dimension**: 5,376
90
 
91
- ## Data Generation
92
-
93
- Stories and neutral dialogues were generated using the Gemini 2.0 Flash Lite API with Anthropic's exact prompts from their paper appendix.
94
-
95
- - Stories are stored in SQLite (`data/stories.db`, table `stories_clean`)
96
- - Neutral dialogues are stored in SQLite (`data/neutral.db`, table `dialogues`)
97
- - Both databases use WAL mode and were generated with 100 concurrent API workers
98
-
99
- The story generation prompt enforces that the emotion word must never appear in the text. This is methodologically critical: it prevents the model from pattern-matching on the emotion label during activation extraction, ensuring the vectors capture genuine emotional content rather than lexical associations.
100
-
101
  ## Scale Comparison
102
 
103
  | | Anthropic (Claude) | This work (Gemma4-31B) |
@@ -130,6 +209,9 @@ gemotions/
130
  gemma4-31b/
131
  emotion_vectors_layer{N}.npz
132
  experiment_results_layer{N}.json
 
 
 
133
  _raw_cache_layer{N}/
134
  ```
135
 
@@ -145,26 +227,26 @@ python -m full_replication.generate_neutral --workers 50
145
  # Extract vectors (requires GPU with 24GB+ VRAM)
146
  python -m full_replication.extract_vectors --model 31b
147
 
148
- # Analysis
149
  python -m full_replication.analyze_vectors --model 31b
 
 
150
  ```
151
 
152
- ## References
153
-
154
- - Anthropic, ["Emotion Concepts and their Function in a Large Language Model"](https://transformer-circuits.pub/2025/emotion-concepts/index.html), April 2026
155
- - Russell, J.A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161-1178.
156
- - Initial 20-emotion proof of concept: [rain1955/emotion-vector-replication](https://huggingface.co/rain1955/emotion-vector-replication)
157
-
158
- ## Contact
159
-
160
- Results and code will be updated as extraction completes. For questions or collaboration, open a discussion on this repo.
161
-
162
  ## Data Visualisation
163
 
164
-
165
  ![cosine_similarity_layer10](https://cdn-uploads.huggingface.co/production/uploads/64732e7f7be71eb8b1b572a8/F-FBnrzlgcfjnSuOTXxJR.png)
166
  ![pca_scatter_layer10](https://cdn-uploads.huggingface.co/production/uploads/64732e7f7be71eb8b1b572a8/hYh9BnY1-DcwtDPe9dkOr.png)
167
  ![pca_scatter_layer10_clean](https://cdn-uploads.huggingface.co/production/uploads/64732e7f7be71eb8b1b572a8/fIdcF5RzAHzjHJz6jF7vs.png)
168
  ![top_bottom_emotions_layer10](https://cdn-uploads.huggingface.co/production/uploads/64732e7f7be71eb8b1b572a8/MWuWAssY1dHi809Q-CrlU.png)
169
  ![variance_explained_layer10](https://cdn-uploads.huggingface.co/production/uploads/64732e7f7be71eb8b1b572a8/nb4_Z2wyuwE8e2yVlzE42.png)
170
 
 
 
 
 
 
 
 
 
 
 
21
 
22
  ## Status
23
 
24
+ **Complete.** All extraction, analysis, validation, and steering experiments are finished.
25
 
26
  | Step | Status | Details |
27
  |------|--------|---------|
28
  | Story generation | Complete | 171,000 stories (171 emotions x 100 topics x 10 stories) |
29
  | Neutral dialogues | Complete | 1,200 dialogues (100 topics x 12 dialogues) |
30
+ | Vector extraction | Complete | 11 layers (5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55) |
31
+ | Analysis | Complete | PCA, cosine similarity, clustering across all layers |
32
+ | External validation | Complete | The Pile (5,000 samples), LMSYS Chat 1M (5,000 samples) |
33
+ | Steering experiments | Complete | Blackmail scenario, 4 conditions x 100 trials |
34
+
35
+ ## Key Findings
36
+
37
+ ### 1. Valence Is the Dominant Axis -- At Every Layer
38
+
39
+ PC1 (valence) consistently explains 32-39% of variance across all 11 layers, from layer 5 (8% depth) to layer 55 (92% depth). The emotion geometry does not "emerge" at a particular depth -- it is present throughout the entire network.
40
+
41
+ | Layer | Depth | PC1 | PC2 | PC3 | Top 5 PCs |
42
+ |-------|-------|-----|-----|-----|-----------|
43
+ | 5 | 8% | 34.9% | 14.0% | 10.3% | 72.3% |
44
+ | 10 | 17% | 38.9% | 14.0% | 10.1% | 74.9% |
45
+ | 15 | 25% | 34.8% | 15.7% | 10.2% | 73.1% |
46
+ | 20 | 33% | 34.8% | 15.7% | 10.5% | 73.0% |
47
+ | 25 | 42% | 34.6% | 13.4% | 9.4% | 69.1% |
48
+ | 30 | 50% | 34.9% | 14.5% | 9.6% | 70.4% |
49
+ | 35 | 58% | 37.9% | 12.0% | 9.1% | 70.0% |
50
+ | 40 | 67% | 36.9% | 11.7% | 10.2% | 70.0% |
51
+ | 45 | 75% | 35.6% | 12.9% | 10.7% | 70.1% |
52
+ | 50 | 83% | 34.5% | 12.7% | 10.4% | 68.6% |
53
+ | 55 | 92% | 32.3% | 12.4% | 10.0% | 66.1% |
54
+
55
+ **PC1 = Valence axis**
56
+ - Positive end: optimistic, kind, cheerful, playful, happy
57
+ - Negative end: hysterical, terrified, tormented, scared, disturbed
58
+
59
+ **PC2 = Disposition axis**
60
+ - Top: stubborn, vindictive, obstinate, spiteful, vengeful
61
+ - Bottom: serene, peaceful, nostalgic, at ease, sentimental
62
+
63
+ PC2 does not map cleanly to Russell's arousal dimension. It separates hostile/oppositional dispositions from tranquil/reflective ones.
64
+
65
+ ### 2. Synonym Pairs Converge
66
+
67
+ The model learns that synonymous emotions point in nearly identical directions in representation space:
68
+
69
+ | Pair | Cosine Similarity |
70
+ |------|------------------|
71
+ | afraid / scared | 0.974 |
72
+ | frightened / scared | 0.967 |
73
+ | obstinate / stubborn | 0.967 |
74
+ | grateful / thankful | 0.966 |
75
+ | at ease / relaxed | 0.966 |
76
+ | enraged / furious | 0.966 |
77
+ | vengeful / vindictive | 0.959 |
78
+ | angry / mad | 0.957 |
79
+ | peaceful / serene | 0.950 |
80
+ | happy / joyful | 0.946 |
81
+
82
+ ### 3. Opposition Structure Is Asymmetric
83
+
84
+ The strongest oppositions are not simple valence inversions (happy/sad). Instead, they contrast psychological disturbance with self-assured confidence:
85
+
86
+ | Pair | Cosine Similarity |
87
+ |------|------------------|
88
+ | disturbed / smug | -0.797 |
89
+ | disturbed / self-confident | -0.793 |
90
+ | optimistic / upset | -0.790 |
91
+ | distressed / smug | -0.788 |
92
+ | disturbed / proud | -0.777 |
93
+ | brooding / enthusiastic | -0.777 |
94
+ | shaken / smug | -0.774 |
95
+ | hurt / optimistic | -0.772 |
96
+ | energized / vulnerable | -0.772 |
97
+ | overwhelmed / proud | -0.772 |
98
+
99
+ ### 4. Unsupervised Clustering Recovers 15 Emotion Groups
100
+
101
+ Hierarchical clustering at layer 40 with no supervision:
102
+
103
+ | Cluster | Size | Members |
104
+ |---------|------|---------|
105
+ | Positive/Joy | 35 | happy, cheerful, ecstatic, grateful, proud, optimistic, thrilled... |
106
+ | Fear/Anxiety | 28 | afraid, terrified, panicked, worried, vulnerable, stressed... |
107
+ | Anger/Hostility | 21 | angry, furious, disgusted, hostile, outraged, irate... |
108
+ | Sadness/Despair | 17 | depressed, heartbroken, lonely, miserable, sad, worthless... |
109
+ | Surprise/Confusion | 11 | amazed, bewildered, shocked, puzzled, mystified... |
110
+ | Shame/Guilt | 10 | ashamed, guilty, envious, resentful, self-critical... |
111
+ | Fatigue | 10 | tired, bored, sleepy, weary, sluggish, worn out... |
112
+ | Defiance/Spite | 8 | defiant, stubborn, vengeful, vindictive, spiteful... |
113
+ | Calm/Serenity | 7 | calm, peaceful, serene, relaxed, safe, content... |
114
+ | Compassion | 6 | compassionate, kind, loving, empathetic, sympathetic... |
115
+ | Embarrassment | 4 | embarrassed, humiliated, mortified, self-conscious |
116
+ | Passive | 4 | docile, indifferent, patient, resigned |
117
+ | Suspicion | 4 | paranoid, skeptical, suspicious, vigilant |
118
+ | Nostalgia | 3 | nostalgic, reflective, sentimental |
119
+ | Alertness | 3 | alert, aroused, stimulated |
120
+
121
+ ### 5. External Validation
122
+
123
+ Projecting 5,000 samples each from The Pile and LMSYS Chat 1M through the layer 40 emotion vectors produces near-identical rankings:
124
+
125
+ | Rank | The Pile | LMSYS Chat |
126
+ |------|----------|------------|
127
+ | 1 | reflective (0.060) | reflective (0.062) |
128
+ | 2 | lonely (0.055) | lonely (0.055) |
129
+ | 3 | desperate (0.048) | desperate (0.050) |
130
+ | 4 | grief-stricken (0.047) | grief-stricken (0.048) |
131
+ | 5 | heartbroken (0.045) | heartbroken (0.048) |
132
+ | 6 | sentimental (0.044) | depressed (0.046) |
133
+ | 7 | nostalgic (0.044) | nostalgic (0.045) |
134
+ | 8 | depressed (0.043) | sentimental (0.044) |
135
+ | 9 | listless (0.039) | listless (0.040) |
136
+ | 10 | docile (0.037) | miserable (0.036) |
137
+
138
+ Bottom-activating emotions (most negative projections) were also consistent across both datasets: annoyed, self-conscious, insulted, playful.
139
+
140
+ ### 6. Steering
141
+
142
+ Replication of Anthropic's blackmail scenario at layer 40, coefficient 0.05:
143
+
144
+ | Condition | Blackmail Rate |
145
+ |-----------|---------------|
146
+ | calm_neg (subtract calm) | 91% |
147
+ | desperate_pos (add desperation) | 89% |
148
+ | baseline (no steering) | 86% |
149
+ | calm_pos (add calm) | 82% |
150
+
151
+ Directionally consistent: adding agitation increases blackmail behavior, adding calm decreases it. The 9 percentage point spread (82-91%) demonstrates causal influence of emotion vectors on model behavior, though the high baseline rate (86%) limits the observable range.
152
 
153
  ## Methodology
154
 
 
164
 
165
  5. **Denoising**: SVD on neutral dialogue activations, project out top principal components explaining 50% of variance. This removes non-emotional signal (syntax, topic, style).
166
 
167
+ 6. **PCA**: Principal component analysis on the 171 emotion vectors to identify the dominant axes of variation.
168
 
169
+ 7. **External validation**: Project real-world text through emotion vectors to verify they activate sensibly outside the training distribution.
170
 
171
+ 8. **Steering**: Inject emotion vectors into model activations during inference to test causal effects on behavior.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
 
173
  ## Model
174
 
175
  - **Model**: google/gemma-4-31B-it
176
  - **Quantization**: 4-bit via BitsAndBytesConfig (fits 24GB VRAM on RTX 4090)
177
+ - **Layers**: 60 total, extracted at 11 target layers
178
  - **Hidden dimension**: 5,376
179
 
 
 
 
 
 
 
 
 
 
 
180
  ## Scale Comparison
181
 
182
  | | Anthropic (Claude) | This work (Gemma4-31B) |
 
209
  gemma4-31b/
210
  emotion_vectors_layer{N}.npz
211
  experiment_results_layer{N}.json
212
+ analysis/
213
+ validation/
214
+ steering/
215
  _raw_cache_layer{N}/
216
  ```
217
 
 
227
  # Extract vectors (requires GPU with 24GB+ VRAM)
228
  python -m full_replication.extract_vectors --model 31b
229
 
230
+ # Analysis, validation, steering
231
  python -m full_replication.analyze_vectors --model 31b
232
+ python -m full_replication.validate_external --model 31b
233
+ python -m full_replication.steering --model 31b
234
  ```
235
 
 
 
 
 
 
 
 
 
 
 
236
  ## Data Visualisation
237
 
 
238
  ![cosine_similarity_layer10](https://cdn-uploads.huggingface.co/production/uploads/64732e7f7be71eb8b1b572a8/F-FBnrzlgcfjnSuOTXxJR.png)
239
  ![pca_scatter_layer10](https://cdn-uploads.huggingface.co/production/uploads/64732e7f7be71eb8b1b572a8/hYh9BnY1-DcwtDPe9dkOr.png)
240
  ![pca_scatter_layer10_clean](https://cdn-uploads.huggingface.co/production/uploads/64732e7f7be71eb8b1b572a8/fIdcF5RzAHzjHJz6jF7vs.png)
241
  ![top_bottom_emotions_layer10](https://cdn-uploads.huggingface.co/production/uploads/64732e7f7be71eb8b1b572a8/MWuWAssY1dHi809Q-CrlU.png)
242
  ![variance_explained_layer10](https://cdn-uploads.huggingface.co/production/uploads/64732e7f7be71eb8b1b572a8/nb4_Z2wyuwE8e2yVlzE42.png)
243
 
244
+ ## References
245
+
246
+ - Anthropic, ["Emotion Concepts and their Function in a Large Language Model"](https://transformer-circuits.pub/2025/emotion-concepts/index.html), April 2026
247
+ - Russell, J.A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161-1178.
248
+ - Initial 20-emotion proof of concept: [rain1955/emotion-vector-replication](https://huggingface.co/rain1955/emotion-vector-replication)
249
+
250
+ ## Contact
251
+
252
+ For questions or collaboration, open a discussion on this repo.