NLA Activation Reconstructor โ Qwen 2.5 7B, Layer 20
LoRA adapter that reconstructs a 3584-dim activation vector from a natural-language description. Scores how well a description captures the geometry of the activation it claims to describe.
A forward hook at layer 20 extracts the hidden state at the injection token position after the model reads the description. Cosine similarity between this extracted vector and the target activation is the score.
Numbers
- Validation cosine similarity: 0.943
- Trained on 5,281 descriptions, LoRA r=32, 10 epochs
- Cross-validates with Anthropic AR at 84% agreement on AV-generated descriptions
How scoring works
- Feed description through model with injection token
- Forward hook captures hidden state at layer 20, injection position
- Cosine similarity between captured hidden and target activation = score
- Higher score means the description better captures what the model was computing