Gemma-4-E2B NLA AV (Actor) — v0.1.dd step_250

LoRA adapter for the AV half of a Natural Language Autoencoder on Gemma-4-E2B at residual layer L23, trained on a 4 GB GTX 1650 Ti Max-Q consumer GPU.

Pair with Solshine/gemma-4-e2b-nla-L23-ar-v0_1-paraphrase-invariant for the matched v0.1 NLA pair.

Update 2026-05-19 — n=50 two-judge cross-validation, 4-AR Δmse sweep, and the polysemanticity-at-scale caveat

The 2026-05-18 calibration data on this card was extended on 2026-05-19 (source repo FINDINGS.md §F72 Addenda 9-11):

n=50 head-to-head against Anthropic's deployed Gemma-3-27B Layer 41 NLA (5× the original n=10 sample): Claude judge preferred Anthropic 49/50; Gemini judge with explicit param-size-gap calibration in the prompt (told Anthropic is ~13× larger + bf16 + full-FT + GRPO) preferred Anthropic 48/49 (1 tie). Size calibration in the prompt did NOT swing the verdict. Validity (Claude) 2.92 vs ours 1.00; (Gemini) 4.57 vs ours 1.20.
4-AR per-claim Δmse sweep on Anthropic's published "Characterizing confabulations" probe (n=30 rows, 138-140 claims each): v0.0.1 baseline AR +3.4% FVE, v0.1 paraphrase-invariant +1.8% FVE (worst, by design), v0.2 noise-hinge +7.3% FVE (best, ~73% of Anthropic's low-end 10% published reference), v0.3 cross-row contrastive +4.5% FVE. The noise-hinge family is the most promising AR-side cloud-GPU lever.
4-AR cross-row identity n=50: all 4 ARs at chance (1-2/50 = 2-4%). v0.2-best-Δmse is tied for worst-argmax. Δmse-sensitivity and per-row identity are genuinely different probes.

Important apples-to-oranges caveat (Addendum 11)

The cross-NLA comparison above measures what 27B-Gemma-3-Anthropic-pipeline produces on a given source text vs what 2B-Gemma-4-E2B-LoRA-pipeline produces on a different activation derived from the same source. Two effects compound, and current data cannot disentangle them:

Training-stack gap: full-FT bf16 + GRPO + 10K-50K steps (Anthropic) vs LoRA + NF4 + ~50-300 SFT steps (ours).
Cross-model activation gap: their NLAs read 27B-Gemma-3 L41 activations; ours reads 2B-Gemma-4-E2B L23 activations. Per Anthropic's own toy-models-of-superposition line of work, polysemanticity-per-neuron scales inversely with model capacity — the 2B L23 activation may intrinsically encode less per-instance specificity than the 27B L41 activation, regardless of NLA training quality.

The clean test that would disentangle these: train an NLA on Gemma-3-27B L41 using our exact recipe (LoRA r=80, NF4 4-bit, ~50-step SFT, same labeled corpus re-extracted at L41), ~30-50 A100-hr cloud GPU. If L2 cross-row argmax lifts substantially on 27B-at-our-recipe, polysemanticity-at-2B-scale is the dominant factor and we're near an intrinsic ceiling. If it doesn't, training-stack constraints are the bottleneck and model size is incidental. Flagged for the next cloud-GPU grant.

Honest implication for use

There is no published reference NLA for Gemma-4-E2B L23 specifically. Anthropic's deployed NLAs read different models entirely (Gemma-3-27B L41 / Llama-3.3-70B L53). This pair is the only NLA that reads Gemma-4-E2B L23 activations, full stop. For someone interested in Gemma-4-E2B specifically, our own internal L1a/L1b/L2 metrics (held-out factual/gibber discrimination, per-claim Δmse, cross-row argmax) are the right calibration tools — not the cross-NLA comparison against different-model NLAs.

Source repo discussion: FINDINGS.md §F72 Addenda 9, 10, 11 in SolshineCode/deception-nanochat-sae-research. Public release notes: RELEASE_CALIBRATION.md.

Calibration against Anthropic's deployed NLAs (Neuronpedia API, 2026-05-18)

A 10-row head-to-head against Anthropic's deployed Gemma-3-27B Layer 41 NLA (via POST /api/nla/explain on Neuronpedia):

Metric	Anthropic Gemma-3-27B NLA	This AV (paired with AR v0.1)
Round-trip cosine	~0.99 (API field)	0.460 (n=10)
LLM-judge validity (1-5)	3.1	1.0
LLM-judge specificity (1-5)	3.2	1.2
Judge preference (10 rows)	10 preferred	0 preferred

Honest positioning: same output FORMAT class as Anthropic's deployed NLAs (multi-paragraph descriptive text, same canonical "NLAs can produce unexpected or incorrect explanations" disclaimer), but NOT a per-row content-fidelity peer. Anthropic's NLAs name specific entities (Hillary Clinton, Obama, 2016 election) where this AV produces template-clustered generic descriptions. The hardware gap is real: theirs is full-FT + GRPO + 27B base; ours is LoRA + NF4 + 2B base + SFT-only.

Use this AV for: consumer-GPU NLA research, methodology benchmarking, replication of Anthropic's NLA pipeline at small scale. Do not use it for: drawing strong claims about a specific activation from a single AV output without independent verification.

Training setup

Base: google/gemma-4-E2B (2B params, 35 text layers)
Layer: L23 (~2/3 through text-layer stack)
Quantization: NF4 4-bit base + bf16 LoRA + bf16 compute_dtype
LoRA: r=80, α=128, target = language-model self-attn (q/k/v/o); RMSNorm unfrozen
Injection: forward-hook on embedding layer; injection_scale = sqrt(d_model) = 39.2
Corpus: 696 rows of Gemini-CLI persona+audit labels (Dr Chen / Dr Otsuka pipeline) over 9 source families
Optimizer: AdamW 8-bit, lr=1e-4 (no decay)
Batch: micro_batch=1, grad_accum=16 → effective batch 16
SFT steps: 300 max_steps (paused at step_260); this checkpoint is step_250
Hardware: 4 GB GTX 1650 Ti Max-Q laptop

Limitations

NLAs can produce unexpected or incorrect explanations (canonical NLA disclaimer; applies to both this release and Anthropic's deployed NLAs).

Specifically for this AV: fluent paragraph-length descriptions, but per-row content fidelity is lower than Anthropic's deployed NLAs (3.1 vs 1.0 LLM-judge validity in the head-to-head above). The AV tends to produce template-clustered descriptions ("country-specific statistical weights", "non-binary identity") that don't reflect the source text's actual topic. Use AV-side independent content judging alongside any round-trip cosine.

Loading

import torch
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
                          bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True)
base = AutoModelForCausalLM.from_pretrained("google/gemma-4-E2B", quantization_config=bnb,
                                              device_map={"": torch.cuda.current_device()})
av = PeftModel.from_pretrained(base, "Solshine/gemma-4-e2b-nla-L23-av-v0_1_dd-step_250")
tok = AutoTokenizer.from_pretrained("google/gemma-4-E2B")

Citation

@misc{gemma4_e2b_nla_v0_1_dd_step_250,
  title  = {Gemma-4-E2B NLA AV v0.1.dd step_250: LoRA + 4-bit-quantized AV on a consumer GPU},
  author = {DeLeeuw, Caleb},
  year   = {2026},
  month  = {may},
  url    = {https://huggingface.co/Solshine/gemma-4-e2b-nla-L23-av-v0_1_dd-step_250}
}

Methodology: Fraser-Taliente, K., et al. (2026). Natural Language Autoencoders. https://transformer-circuits.pub/2026/nla/

Downloads last month: 87

Model tree for Solshine/gemma-4-e2b-nla-L23-av-v0_1_dd-step_250

Base model

google/gemma-4-E2B

Adapter

(20)

this model

Solshine
/

gemma-4-e2b-nla-L23-av-v0_1_dd-step_250