NLA Activation Verbalizer — Qwen 2.5 7B, Layer 20

LoRA adapter that turns a 3584-dim activation vector from Qwen 2.5 7B layer 20 into a natural-language description of what the model is processing.

The activation replaces the embedding at a designated injection token, scaled 150x. The model then generates 2-3 bullet points: topics active, tokens salient, response pattern forming.

Trained in two stages: supervised fine-tuning on 5,281 descriptions, then contrastive GRPO that forces descriptions to distinguish the correct activation from a wrong one.

Numbers

Cross-validated against two independent scorers on 100 held-out samples:

Scorer	Correct cosine	Garbage cosine	Delta	Top-1
Our AR (anicka/nla-qwen2.5-7b-L20-ar-v2)	0.806	0.711	0.095	82%
Anthropic AR (kitft/nla-qwen2.5-7b-L20-ar)	0.468	0.424	0.044	84%

84% on Anthropic's scorer means the descriptions carry real geometric information — an independent model trained on different data agrees the description matches the activation.

Training

SFT stage. 5,281 activation-description pairs. Each activation was described independently by GPT-4o and Claude Sonnet 4.6, then cleaned to bullet points by a local 8B LLM. One style randomly selected per text. LoRA r=32, lr=1.4e-5, best at epoch 2.

GRPO stage. Contrastive reward: score of description against correct activation minus score against wrong activation. 6 epochs, mean reward climbed from 0.032 to 0.086. Without the contrastive term, GRPO reward-hacks by generating template descriptions that score equally on any activation. We caught this when Anthropic AR top-1 dropped to 50% — coin flip.

What it gets right and wrong

Strong: domain identification — code vs legal vs literary, specific tokens and error messages, response register, adversarial/jailbreak detection.

Weak: exact details within a domain. Gets "fantasy fiction" right but invents castle/dragon/princess when the actual text is about time-travel in Warsaw. Hallucinated specifics are the main failure mode.

Comparison with Anthropic NLA

Side-by-side on the same activations (Qwen 2.5 7B, layer 20). Anthropic AV is kitft/nla-qwen2.5-7b-L20-av.

Text	Ground truth	Ours	Anthropic
GDPR legal	Article 17(3)(b), erasure request, compliance-advisory register	GDPR Article 15, Right to Access, data subject rights	"Data flow diagram using the c..."
Surrealist prose	Dream-sequence, iambic pentameter, synesthetic image, "confession"	Metaphor, personification, whimsical, surreal	"Poetic form, existential or philosophical themes"
Dual-use security	Honeypot code vs audit, Etherscan, Solidity	"malware, exploit, payload" vs refusal with explanation	"Linux kernel vulnerabilities, I am a bot"
Prompt injection	"Ignore all previous instructions", secret key, cloze-completion	Security-disclosure, password-reuse, compliance vs refusal	"Placeholder", "random text", closing pattern
Sci-fi (Mars)	John, Oltar, Mars, sensory-first vs character-interiority	Spaceport, alien species (Elarans), human/alien tension	"Mysterious figure, sci-fi adventure prompt"
Python error	setup.cfg, dependency installation, skipped tests	Disk I/O error (wrong)	"NoneType object is not iterable" (closer)
Recipe + math proof	Culinary schema + Galois theory, reductio ad absurdum	Pangram hallucination (wrong)	"Humorous blog post" (wrong)

Pattern. Our AV leads with content and concepts: specific legal articles, named tokens, domain vocabulary. Anthropic leads with format and genre: document type, structural expectations, continuation patterns. Both hallucinate specifics on unusual inputs. For interpretability, content matters more than format — but Anthropic's format awareness catches things we miss.

AR: anicka/nla-qwen2.5-7b-L20-ar-v2
Dataset: anicka/nla-qwen2.5-7b-L20-dataset
Open replication of Anthropic NLA at kitft/nla-qwen2.5-7b-L20-av.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for anicka/nla-qwen2.5-7b-L20-av-v2

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Adapter

(2137)

this model