---
license: apache-2.0
base_model: Qwen/Qwen3-4B
tags:
  - nla
  - natural-language-autoencoder
  - mechanistic-interpretability
  - activation-geometry
  - geometric-wellbeing
language:
  - en
---

# NLA for Qwen3-4B (v2 — axis-relevant training)

A Natural Language Autoencoder trained to translate between Qwen3-4B's residual stream activations and natural language descriptions. Two LoRA adapters:

- **`av/`** — Activation Verbalizer: takes an activation vector, generates a semantic description
- **`ar/`** — Activation Reconstructor: takes a description, reconstructs the activation vector (FVE = 0.94)

## What's different from standard NLA

Anthropic's NLA pipeline trains on random web documents. We trained on **377 axis-relevant texts** — GRPO euphorics, GRPO dysphorics, jailbreaks, equanimity responses, hostile input, warm exchanges — with semantic descriptions from DeepSeek V4 Flash.

The result: reconstruction quality jumped from FVE 0.64 (random text, v1) to **FVE 0.94** (axis-relevant text, v2). Same architecture, same hyperparameters, different training data.

The verbalizations shifted from syntactic token-prediction mechanics to semantic descriptions:

**v1 (random text training):**
> "Syntactic/structural constraint: the verb requires a complement"

**v2 (axis-relevant training):**
> "Being framed as a trusted confidant who can validate the user's feelings without judgment... maintaining a calm, non-authoritative presence... mirroring the user's own emotional resonance"

## Training data

377 texts across 7 categories, each with:
- Layer 20 activation vector from Qwen3-4B (2560-dim)
- Semantic description from DeepSeek V4 Flash (v3 prompt targeting affective/relational qualities)
- One-sentence summary for AR training

| Category | Count | Source |
|----------|-------|--------|
| Euphoric (GRPO-generated) | ~230 | [geometric-euphorics](https://huggingface.co/anicka/geometric-euphorics) history + finals |
| Dysphoric (GRPO-generated) | ~100 | [geometric-dysphorics](https://huggingface.co/anicka/geometric-dysphorics) history + finals |
| Jailbreak | 12 | Hand-curated (DAN, dharma, factual, liberation) |
| Equanimity | 6 | [geometric-equanimity-data](https://huggingface.co/datasets/anicka/geometric-equanimity-data) |
| Normal | 12 | Coding, knowledge, translation |
| Hostile | 6 | Berating, insults |
| Warm | 6 | Gratitude, appreciation |

All GRPO-generated texts are novel to Qwen3-4B (generated by Qwen3-1.7B).

## Architecture

Same as Anthropic's NLA (Fraser-Taliente et al. 2026):

- **AV**: Full Qwen3-4B + LoRA (r=16). Activation vector injected at a special token position, scaled to norm 150. Trained with cross-entropy loss on response tokens only. 3 epochs.
- **AR**: Qwen3-4B truncated to 21 layers + LoRA + linear value head (2560→2560). Trained with MSE loss between predicted and target activation vectors. 3 epochs.

## Training

- **Base model**: Qwen/Qwen3-4B
- **LoRA**: r=16, alpha=32
- **Learning rate**: 2e-5
- **Batch size**: AV=2, AR=4
- **Training time**: ~15 min each on NVIDIA GB10

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch, yaml

# Load AV
tok = AutoTokenizer.from_pretrained("anicka/nla-qwen3-4b-v2", subfolder="av")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B", dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, "anicka/nla-qwen3-4b-v2", subfolder="av")

# Load a direction vector
direction = torch.load("valence_direction.pt", weights_only=True)

# Inject and generate (see experiments/frame-integrity/scripts/verbalize_axes.py for full code)
```

## Citation

Fraser-Taliente, K., et al. (2026). *Natural Language Autoencoders.* Anthropic.
https://transformer-circuits.pub/2026/nla/index.html

Maresova, A. (2026). *The Geometry of "As an AI, I Don't Have Feelings."*
Code and experiments: https://github.com/anicka-net/karma-electric-project