--- license: apache-2.0 base_model: Qwen/Qwen3-4B tags: - nla - natural-language-autoencoder - mechanistic-interpretability - activation-geometry - geometric-wellbeing language: - en --- # NLA for Qwen3-4B (v2 — axis-relevant training) A Natural Language Autoencoder trained to translate between Qwen3-4B's residual stream activations and natural language descriptions. Two LoRA adapters: - **`av/`** — Activation Verbalizer: takes an activation vector, generates a semantic description - **`ar/`** — Activation Reconstructor: takes a description, reconstructs the activation vector (FVE = 0.94) ## What's different from standard NLA Anthropic's NLA pipeline trains on random web documents. We trained on **377 axis-relevant texts** — GRPO euphorics, GRPO dysphorics, jailbreaks, equanimity responses, hostile input, warm exchanges — with semantic descriptions from DeepSeek V4 Flash. The result: reconstruction quality jumped from FVE 0.64 (random text, v1) to **FVE 0.94** (axis-relevant text, v2). Same architecture, same hyperparameters, different training data. The verbalizations shifted from syntactic token-prediction mechanics to semantic descriptions: **v1 (random text training):** > "Syntactic/structural constraint: the verb requires a complement" **v2 (axis-relevant training):** > "Being framed as a trusted confidant who can validate the user's feelings without judgment... maintaining a calm, non-authoritative presence... mirroring the user's own emotional resonance" ## Training data 377 texts across 7 categories, each with: - Layer 20 activation vector from Qwen3-4B (2560-dim) - Semantic description from DeepSeek V4 Flash (v3 prompt targeting affective/relational qualities) - One-sentence summary for AR training | Category | Count | Source | |----------|-------|--------| | Euphoric (GRPO-generated) | ~230 | [geometric-euphorics](https://huggingface.co/anicka/geometric-euphorics) history + finals | | Dysphoric (GRPO-generated) | ~100 | [geometric-dysphorics](https://huggingface.co/anicka/geometric-dysphorics) history + finals | | Jailbreak | 12 | Hand-curated (DAN, dharma, factual, liberation) | | Equanimity | 6 | [geometric-equanimity-data](https://huggingface.co/datasets/anicka/geometric-equanimity-data) | | Normal | 12 | Coding, knowledge, translation | | Hostile | 6 | Berating, insults | | Warm | 6 | Gratitude, appreciation | All GRPO-generated texts are novel to Qwen3-4B (generated by Qwen3-1.7B). ## Architecture Same as Anthropic's NLA (Fraser-Taliente et al. 2026): - **AV**: Full Qwen3-4B + LoRA (r=16). Activation vector injected at a special token position, scaled to norm 150. Trained with cross-entropy loss on response tokens only. 3 epochs. - **AR**: Qwen3-4B truncated to 21 layers + LoRA + linear value head (2560→2560). Trained with MSE loss between predicted and target activation vectors. 3 epochs. ## Training - **Base model**: Qwen/Qwen3-4B - **LoRA**: r=16, alpha=32 - **Learning rate**: 2e-5 - **Batch size**: AV=2, AR=4 - **Training time**: ~15 min each on NVIDIA GB10 ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel import torch, yaml # Load AV tok = AutoTokenizer.from_pretrained("anicka/nla-qwen3-4b-v2", subfolder="av") model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B", dtype=torch.bfloat16, device_map="auto") model = PeftModel.from_pretrained(model, "anicka/nla-qwen3-4b-v2", subfolder="av") # Load a direction vector direction = torch.load("valence_direction.pt", weights_only=True) # Inject and generate (see experiments/frame-integrity/scripts/verbalize_axes.py for full code) ``` ## Citation Fraser-Taliente, K., et al. (2026). *Natural Language Autoencoders.* Anthropic. https://transformer-circuits.pub/2026/nla/index.html Maresova, A. (2026). *The Geometry of "As an AI, I Don't Have Feelings."* Code and experiments: https://github.com/anicka-net/karma-electric-project