anicka
/

nla-qwen2.5-7b-L20-ar-v2

Text Generation

interpretability

activation-reconstruction

Model card Files Files and versions

NLA Activation Reconstructor — Qwen 2.5 7B, Layer 20

LoRA adapter that reconstructs a 3584-dim activation vector from a natural-language description. Scores how well a description captures the geometry of the activation it claims to describe.

A forward hook at layer 20 extracts the hidden state at the injection token position after the model reads the description. Cosine similarity between this extracted vector and the target activation is the score.

Numbers

Validation cosine similarity: 0.943
Trained on 5,281 descriptions, LoRA r=32, 10 epochs
Cross-validates with Anthropic AR at 84% agreement on AV-generated descriptions

How scoring works

Feed description through model with injection token
Forward hook captures hidden state at layer 20, injection position
Cosine similarity between captured hidden and target activation = score
Higher score means the description better captures what the model was computing

Related

AV: anicka/nla-qwen2.5-7b-L20-av-v2
Dataset: anicka/nla-qwen2.5-7b-L20-dataset

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for anicka/nla-qwen2.5-7b-L20-ar-v2

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Adapter

(2137)

this model