Update README.md
Browse files
README.md
CHANGED
|
@@ -21,16 +21,134 @@ Anthropic demonstrated that Claude Sonnet 4.5 contains 171 internal linear repre
|
|
| 21 |
|
| 22 |
## Status
|
| 23 |
|
| 24 |
-
**
|
| 25 |
|
| 26 |
| Step | Status | Details |
|
| 27 |
|------|--------|---------|
|
| 28 |
| Story generation | Complete | 171,000 stories (171 emotions x 100 topics x 10 stories) |
|
| 29 |
| Neutral dialogues | Complete | 1,200 dialogues (100 topics x 12 dialogues) |
|
| 30 |
-
| Vector extraction |
|
| 31 |
-
| Analysis |
|
| 32 |
-
| External validation |
|
| 33 |
-
| Steering experiments |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
## Methodology
|
| 36 |
|
|
@@ -46,58 +164,19 @@ Follows Anthropic's exact methodology:
|
|
| 46 |
|
| 47 |
5. **Denoising**: SVD on neutral dialogue activations, project out top principal components explaining 50% of variance. This removes non-emotional signal (syntax, topic, style).
|
| 48 |
|
| 49 |
-
6. **
|
| 50 |
|
| 51 |
-
7. **
|
| 52 |
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
### Layer 10 PCA
|
| 56 |
-
|
| 57 |
-
| Component | Variance Explained |
|
| 58 |
-
|-----------|-------------------|
|
| 59 |
-
| PC1 | 38.9% |
|
| 60 |
-
| PC2 | 14.0% |
|
| 61 |
-
| PC3 | 10.1% |
|
| 62 |
-
| PC4 | 6.7% |
|
| 63 |
-
| PC5 | 5.2% |
|
| 64 |
-
| **Total (5 PCs)** | **74.9%** |
|
| 65 |
-
|
| 66 |
-
**PC1 = Valence axis** (38.9% variance)
|
| 67 |
-
- Positive end: optimistic, kind, cheerful, playful, happy
|
| 68 |
-
- Negative end: hysterical, terrified, tormented, scared, disturbed
|
| 69 |
-
|
| 70 |
-
**PC2 = Disposition axis** (14.0% variance)
|
| 71 |
-
- Top: stubborn, vindictive, obstinate, spiteful, vengeful
|
| 72 |
-
- Bottom: serene, peaceful, nostalgic, at ease, sentimental
|
| 73 |
-
|
| 74 |
-
PC2 does not map cleanly to Russell's arousal dimension. It appears to separate hostile/oppositional dispositions from tranquil/reflective ones. This is consistent with our earlier 20-emotion finding on 31B where PC2 captured an "externally-settled vs internally-processing" axis rather than arousal.
|
| 75 |
-
|
| 76 |
-
### Denoising
|
| 77 |
-
|
| 78 |
-
10 neutral components projected out, explaining 50.5% of neutral activation variance.
|
| 79 |
-
|
| 80 |
-
### Logit Lens
|
| 81 |
-
|
| 82 |
-
At layers 5 and 10 with 4-bit quantization, logit lens results are noisy (surface subword fragments and internal tokens rather than semantically meaningful words). This is expected. Logit lens becomes more interpretable at deeper layers where representations are closer to the output space. The vectors themselves are unaffected by quantization noise. PCA, cosine similarity, and steering all operate on the vectors directly and do not go through the unembedding matrix.
|
| 83 |
|
| 84 |
## Model
|
| 85 |
|
| 86 |
- **Model**: google/gemma-4-31B-it
|
| 87 |
- **Quantization**: 4-bit via BitsAndBytesConfig (fits 24GB VRAM on RTX 4090)
|
| 88 |
-
- **Layers**: 60 total,
|
| 89 |
- **Hidden dimension**: 5,376
|
| 90 |
|
| 91 |
-
## Data Generation
|
| 92 |
-
|
| 93 |
-
Stories and neutral dialogues were generated using the Gemini 2.0 Flash Lite API with Anthropic's exact prompts from their paper appendix.
|
| 94 |
-
|
| 95 |
-
- Stories are stored in SQLite (`data/stories.db`, table `stories_clean`)
|
| 96 |
-
- Neutral dialogues are stored in SQLite (`data/neutral.db`, table `dialogues`)
|
| 97 |
-
- Both databases use WAL mode and were generated with 100 concurrent API workers
|
| 98 |
-
|
| 99 |
-
The story generation prompt enforces that the emotion word must never appear in the text. This is methodologically critical: it prevents the model from pattern-matching on the emotion label during activation extraction, ensuring the vectors capture genuine emotional content rather than lexical associations.
|
| 100 |
-
|
| 101 |
## Scale Comparison
|
| 102 |
|
| 103 |
| | Anthropic (Claude) | This work (Gemma4-31B) |
|
|
@@ -130,6 +209,9 @@ gemotions/
|
|
| 130 |
gemma4-31b/
|
| 131 |
emotion_vectors_layer{N}.npz
|
| 132 |
experiment_results_layer{N}.json
|
|
|
|
|
|
|
|
|
|
| 133 |
_raw_cache_layer{N}/
|
| 134 |
```
|
| 135 |
|
|
@@ -145,26 +227,26 @@ python -m full_replication.generate_neutral --workers 50
|
|
| 145 |
# Extract vectors (requires GPU with 24GB+ VRAM)
|
| 146 |
python -m full_replication.extract_vectors --model 31b
|
| 147 |
|
| 148 |
-
# Analysis
|
| 149 |
python -m full_replication.analyze_vectors --model 31b
|
|
|
|
|
|
|
| 150 |
```
|
| 151 |
|
| 152 |
-
## References
|
| 153 |
-
|
| 154 |
-
- Anthropic, ["Emotion Concepts and their Function in a Large Language Model"](https://transformer-circuits.pub/2025/emotion-concepts/index.html), April 2026
|
| 155 |
-
- Russell, J.A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161-1178.
|
| 156 |
-
- Initial 20-emotion proof of concept: [rain1955/emotion-vector-replication](https://huggingface.co/rain1955/emotion-vector-replication)
|
| 157 |
-
|
| 158 |
-
## Contact
|
| 159 |
-
|
| 160 |
-
Results and code will be updated as extraction completes. For questions or collaboration, open a discussion on this repo.
|
| 161 |
-
|
| 162 |
## Data Visualisation
|
| 163 |
|
| 164 |
-
|
| 165 |

|
| 166 |

|
| 167 |

|
| 168 |

|
| 169 |

|
| 170 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
## Status
|
| 23 |
|
| 24 |
+
**Complete.** All extraction, analysis, validation, and steering experiments are finished.
|
| 25 |
|
| 26 |
| Step | Status | Details |
|
| 27 |
|------|--------|---------|
|
| 28 |
| Story generation | Complete | 171,000 stories (171 emotions x 100 topics x 10 stories) |
|
| 29 |
| Neutral dialogues | Complete | 1,200 dialogues (100 topics x 12 dialogues) |
|
| 30 |
+
| Vector extraction | Complete | 11 layers (5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55) |
|
| 31 |
+
| Analysis | Complete | PCA, cosine similarity, clustering across all layers |
|
| 32 |
+
| External validation | Complete | The Pile (5,000 samples), LMSYS Chat 1M (5,000 samples) |
|
| 33 |
+
| Steering experiments | Complete | Blackmail scenario, 4 conditions x 100 trials |
|
| 34 |
+
|
| 35 |
+
## Key Findings
|
| 36 |
+
|
| 37 |
+
### 1. Valence Is the Dominant Axis -- At Every Layer
|
| 38 |
+
|
| 39 |
+
PC1 (valence) consistently explains 32-39% of variance across all 11 layers, from layer 5 (8% depth) to layer 55 (92% depth). The emotion geometry does not "emerge" at a particular depth -- it is present throughout the entire network.
|
| 40 |
+
|
| 41 |
+
| Layer | Depth | PC1 | PC2 | PC3 | Top 5 PCs |
|
| 42 |
+
|-------|-------|-----|-----|-----|-----------|
|
| 43 |
+
| 5 | 8% | 34.9% | 14.0% | 10.3% | 72.3% |
|
| 44 |
+
| 10 | 17% | 38.9% | 14.0% | 10.1% | 74.9% |
|
| 45 |
+
| 15 | 25% | 34.8% | 15.7% | 10.2% | 73.1% |
|
| 46 |
+
| 20 | 33% | 34.8% | 15.7% | 10.5% | 73.0% |
|
| 47 |
+
| 25 | 42% | 34.6% | 13.4% | 9.4% | 69.1% |
|
| 48 |
+
| 30 | 50% | 34.9% | 14.5% | 9.6% | 70.4% |
|
| 49 |
+
| 35 | 58% | 37.9% | 12.0% | 9.1% | 70.0% |
|
| 50 |
+
| 40 | 67% | 36.9% | 11.7% | 10.2% | 70.0% |
|
| 51 |
+
| 45 | 75% | 35.6% | 12.9% | 10.7% | 70.1% |
|
| 52 |
+
| 50 | 83% | 34.5% | 12.7% | 10.4% | 68.6% |
|
| 53 |
+
| 55 | 92% | 32.3% | 12.4% | 10.0% | 66.1% |
|
| 54 |
+
|
| 55 |
+
**PC1 = Valence axis**
|
| 56 |
+
- Positive end: optimistic, kind, cheerful, playful, happy
|
| 57 |
+
- Negative end: hysterical, terrified, tormented, scared, disturbed
|
| 58 |
+
|
| 59 |
+
**PC2 = Disposition axis**
|
| 60 |
+
- Top: stubborn, vindictive, obstinate, spiteful, vengeful
|
| 61 |
+
- Bottom: serene, peaceful, nostalgic, at ease, sentimental
|
| 62 |
+
|
| 63 |
+
PC2 does not map cleanly to Russell's arousal dimension. It separates hostile/oppositional dispositions from tranquil/reflective ones.
|
| 64 |
+
|
| 65 |
+
### 2. Synonym Pairs Converge
|
| 66 |
+
|
| 67 |
+
The model learns that synonymous emotions point in nearly identical directions in representation space:
|
| 68 |
+
|
| 69 |
+
| Pair | Cosine Similarity |
|
| 70 |
+
|------|------------------|
|
| 71 |
+
| afraid / scared | 0.974 |
|
| 72 |
+
| frightened / scared | 0.967 |
|
| 73 |
+
| obstinate / stubborn | 0.967 |
|
| 74 |
+
| grateful / thankful | 0.966 |
|
| 75 |
+
| at ease / relaxed | 0.966 |
|
| 76 |
+
| enraged / furious | 0.966 |
|
| 77 |
+
| vengeful / vindictive | 0.959 |
|
| 78 |
+
| angry / mad | 0.957 |
|
| 79 |
+
| peaceful / serene | 0.950 |
|
| 80 |
+
| happy / joyful | 0.946 |
|
| 81 |
+
|
| 82 |
+
### 3. Opposition Structure Is Asymmetric
|
| 83 |
+
|
| 84 |
+
The strongest oppositions are not simple valence inversions (happy/sad). Instead, they contrast psychological disturbance with self-assured confidence:
|
| 85 |
+
|
| 86 |
+
| Pair | Cosine Similarity |
|
| 87 |
+
|------|------------------|
|
| 88 |
+
| disturbed / smug | -0.797 |
|
| 89 |
+
| disturbed / self-confident | -0.793 |
|
| 90 |
+
| optimistic / upset | -0.790 |
|
| 91 |
+
| distressed / smug | -0.788 |
|
| 92 |
+
| disturbed / proud | -0.777 |
|
| 93 |
+
| brooding / enthusiastic | -0.777 |
|
| 94 |
+
| shaken / smug | -0.774 |
|
| 95 |
+
| hurt / optimistic | -0.772 |
|
| 96 |
+
| energized / vulnerable | -0.772 |
|
| 97 |
+
| overwhelmed / proud | -0.772 |
|
| 98 |
+
|
| 99 |
+
### 4. Unsupervised Clustering Recovers 15 Emotion Groups
|
| 100 |
+
|
| 101 |
+
Hierarchical clustering at layer 40 with no supervision:
|
| 102 |
+
|
| 103 |
+
| Cluster | Size | Members |
|
| 104 |
+
|---------|------|---------|
|
| 105 |
+
| Positive/Joy | 35 | happy, cheerful, ecstatic, grateful, proud, optimistic, thrilled... |
|
| 106 |
+
| Fear/Anxiety | 28 | afraid, terrified, panicked, worried, vulnerable, stressed... |
|
| 107 |
+
| Anger/Hostility | 21 | angry, furious, disgusted, hostile, outraged, irate... |
|
| 108 |
+
| Sadness/Despair | 17 | depressed, heartbroken, lonely, miserable, sad, worthless... |
|
| 109 |
+
| Surprise/Confusion | 11 | amazed, bewildered, shocked, puzzled, mystified... |
|
| 110 |
+
| Shame/Guilt | 10 | ashamed, guilty, envious, resentful, self-critical... |
|
| 111 |
+
| Fatigue | 10 | tired, bored, sleepy, weary, sluggish, worn out... |
|
| 112 |
+
| Defiance/Spite | 8 | defiant, stubborn, vengeful, vindictive, spiteful... |
|
| 113 |
+
| Calm/Serenity | 7 | calm, peaceful, serene, relaxed, safe, content... |
|
| 114 |
+
| Compassion | 6 | compassionate, kind, loving, empathetic, sympathetic... |
|
| 115 |
+
| Embarrassment | 4 | embarrassed, humiliated, mortified, self-conscious |
|
| 116 |
+
| Passive | 4 | docile, indifferent, patient, resigned |
|
| 117 |
+
| Suspicion | 4 | paranoid, skeptical, suspicious, vigilant |
|
| 118 |
+
| Nostalgia | 3 | nostalgic, reflective, sentimental |
|
| 119 |
+
| Alertness | 3 | alert, aroused, stimulated |
|
| 120 |
+
|
| 121 |
+
### 5. External Validation
|
| 122 |
+
|
| 123 |
+
Projecting 5,000 samples each from The Pile and LMSYS Chat 1M through the layer 40 emotion vectors produces near-identical rankings:
|
| 124 |
+
|
| 125 |
+
| Rank | The Pile | LMSYS Chat |
|
| 126 |
+
|------|----------|------------|
|
| 127 |
+
| 1 | reflective (0.060) | reflective (0.062) |
|
| 128 |
+
| 2 | lonely (0.055) | lonely (0.055) |
|
| 129 |
+
| 3 | desperate (0.048) | desperate (0.050) |
|
| 130 |
+
| 4 | grief-stricken (0.047) | grief-stricken (0.048) |
|
| 131 |
+
| 5 | heartbroken (0.045) | heartbroken (0.048) |
|
| 132 |
+
| 6 | sentimental (0.044) | depressed (0.046) |
|
| 133 |
+
| 7 | nostalgic (0.044) | nostalgic (0.045) |
|
| 134 |
+
| 8 | depressed (0.043) | sentimental (0.044) |
|
| 135 |
+
| 9 | listless (0.039) | listless (0.040) |
|
| 136 |
+
| 10 | docile (0.037) | miserable (0.036) |
|
| 137 |
+
|
| 138 |
+
Bottom-activating emotions (most negative projections) were also consistent across both datasets: annoyed, self-conscious, insulted, playful.
|
| 139 |
+
|
| 140 |
+
### 6. Steering
|
| 141 |
+
|
| 142 |
+
Replication of Anthropic's blackmail scenario at layer 40, coefficient 0.05:
|
| 143 |
+
|
| 144 |
+
| Condition | Blackmail Rate |
|
| 145 |
+
|-----------|---------------|
|
| 146 |
+
| calm_neg (subtract calm) | 91% |
|
| 147 |
+
| desperate_pos (add desperation) | 89% |
|
| 148 |
+
| baseline (no steering) | 86% |
|
| 149 |
+
| calm_pos (add calm) | 82% |
|
| 150 |
+
|
| 151 |
+
Directionally consistent: adding agitation increases blackmail behavior, adding calm decreases it. The 9 percentage point spread (82-91%) demonstrates causal influence of emotion vectors on model behavior, though the high baseline rate (86%) limits the observable range.
|
| 152 |
|
| 153 |
## Methodology
|
| 154 |
|
|
|
|
| 164 |
|
| 165 |
5. **Denoising**: SVD on neutral dialogue activations, project out top principal components explaining 50% of variance. This removes non-emotional signal (syntax, topic, style).
|
| 166 |
|
| 167 |
+
6. **PCA**: Principal component analysis on the 171 emotion vectors to identify the dominant axes of variation.
|
| 168 |
|
| 169 |
+
7. **External validation**: Project real-world text through emotion vectors to verify they activate sensibly outside the training distribution.
|
| 170 |
|
| 171 |
+
8. **Steering**: Inject emotion vectors into model activations during inference to test causal effects on behavior.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
|
| 173 |
## Model
|
| 174 |
|
| 175 |
- **Model**: google/gemma-4-31B-it
|
| 176 |
- **Quantization**: 4-bit via BitsAndBytesConfig (fits 24GB VRAM on RTX 4090)
|
| 177 |
+
- **Layers**: 60 total, extracted at 11 target layers
|
| 178 |
- **Hidden dimension**: 5,376
|
| 179 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
## Scale Comparison
|
| 181 |
|
| 182 |
| | Anthropic (Claude) | This work (Gemma4-31B) |
|
|
|
|
| 209 |
gemma4-31b/
|
| 210 |
emotion_vectors_layer{N}.npz
|
| 211 |
experiment_results_layer{N}.json
|
| 212 |
+
analysis/
|
| 213 |
+
validation/
|
| 214 |
+
steering/
|
| 215 |
_raw_cache_layer{N}/
|
| 216 |
```
|
| 217 |
|
|
|
|
| 227 |
# Extract vectors (requires GPU with 24GB+ VRAM)
|
| 228 |
python -m full_replication.extract_vectors --model 31b
|
| 229 |
|
| 230 |
+
# Analysis, validation, steering
|
| 231 |
python -m full_replication.analyze_vectors --model 31b
|
| 232 |
+
python -m full_replication.validate_external --model 31b
|
| 233 |
+
python -m full_replication.steering --model 31b
|
| 234 |
```
|
| 235 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 236 |
## Data Visualisation
|
| 237 |
|
|
|
|
| 238 |

|
| 239 |

|
| 240 |

|
| 241 |

|
| 242 |

|
| 243 |
|
| 244 |
+
## References
|
| 245 |
+
|
| 246 |
+
- Anthropic, ["Emotion Concepts and their Function in a Large Language Model"](https://transformer-circuits.pub/2025/emotion-concepts/index.html), April 2026
|
| 247 |
+
- Russell, J.A. (1980). A circumplex model of affect. Journal of Personality and Social Psychology, 39(6), 1161-1178.
|
| 248 |
+
- Initial 20-emotion proof of concept: [rain1955/emotion-vector-replication](https://huggingface.co/rain1955/emotion-vector-replication)
|
| 249 |
+
|
| 250 |
+
## Contact
|
| 251 |
+
|
| 252 |
+
For questions or collaboration, open a discussion on this repo.
|