Gemma-4-E2B NLA AV (Activation Verbalizer) — v0.0.1

LoRA adapter for google/gemma-4-E2B that takes a 1536-dimensional residual-stream activation captured at layer 23 and produces a natural-language explanation of what the activation represents.

This is the first non-Anthropic-team open-source NLA Activation Verbalizer released publicly. Trained end-to-end on a single 4 GB consumer GPU (NVIDIA GTX 1650 Ti Max-Q) following the methodology of Fraser-Taliente, Kantamneni, Ong et al. 2026 (Transformer Circuits).

Pairs with the matched Solshine/gemma-4-e2b-nla-L23-ar-v0_0_1 reconstructor.

Update 2026-05-19 — n=50 two-judge cross-validation, 4-AR Δmse sweep, and the polysemanticity-at-scale caveat

The 2026-05-18 calibration data on this card was extended on 2026-05-19 (source repo FINDINGS.md §F72 Addenda 9-11):

n=50 head-to-head against Anthropic's deployed Gemma-3-27B Layer 41 NLA (5× the original n=10 sample): Claude judge preferred Anthropic 49/50; Gemini judge with explicit param-size-gap calibration in the prompt (told Anthropic is ~13× larger + bf16 + full-FT + GRPO) preferred Anthropic 48/49 (1 tie). Size calibration in the prompt did NOT swing the verdict. Validity (Claude) 2.92 vs ours 1.00; (Gemini) 4.57 vs ours 1.20.
4-AR per-claim Δmse sweep on Anthropic's published "Characterizing confabulations" probe (n=30 rows, 138-140 claims each): v0.0.1 baseline AR +3.4% FVE, v0.1 paraphrase-invariant +1.8% FVE (worst, by design), v0.2 noise-hinge +7.3% FVE (best, ~73% of Anthropic's low-end 10% published reference), v0.3 cross-row contrastive +4.5% FVE. The noise-hinge family is the most promising AR-side cloud-GPU lever.
4-AR cross-row identity n=50: all 4 ARs at chance (1-2/50 = 2-4%). v0.2-best-Δmse is tied for worst-argmax. Δmse-sensitivity and per-row identity are genuinely different probes.

Important apples-to-oranges caveat (Addendum 11)

The cross-NLA comparison above measures what 27B-Gemma-3-Anthropic-pipeline produces on a given source text vs what 2B-Gemma-4-E2B-LoRA-pipeline produces on a different activation derived from the same source. Two effects compound, and current data cannot disentangle them:

Training-stack gap: full-FT bf16 + GRPO + 10K-50K steps (Anthropic) vs LoRA + NF4 + ~50-300 SFT steps (ours).
Cross-model activation gap: their NLAs read 27B-Gemma-3 L41 activations; ours reads 2B-Gemma-4-E2B L23 activations. Per Anthropic's own toy-models-of-superposition line of work, polysemanticity-per-neuron scales inversely with model capacity — the 2B L23 activation may intrinsically encode less per-instance specificity than the 27B L41 activation, regardless of NLA training quality.

The clean test that would disentangle these: train an NLA on Gemma-3-27B L41 using our exact recipe (LoRA r=80, NF4 4-bit, ~50-step SFT, same labeled corpus re-extracted at L41), ~30-50 A100-hr cloud GPU. If L2 cross-row argmax lifts substantially on 27B-at-our-recipe, polysemanticity-at-2B-scale is the dominant factor and we're near an intrinsic ceiling. If it doesn't, training-stack constraints are the bottleneck and model size is incidental. Flagged for the next cloud-GPU grant.

Honest implication for use

There is no published reference NLA for Gemma-4-E2B L23 specifically. Anthropic's deployed NLAs read different models entirely (Gemma-3-27B L41 / Llama-3.3-70B L53). This pair is the only NLA that reads Gemma-4-E2B L23 activations, full stop. For someone interested in Gemma-4-E2B specifically, our own internal L1a/L1b/L2 metrics (held-out factual/gibber discrimination, per-claim Δmse, cross-row argmax) are the right calibration tools — not the cross-NLA comparison against different-model NLAs.

Source repo discussion: FINDINGS.md §F72 Addenda 9, 10, 11 in SolshineCode/deception-nanochat-sae-research. Public release notes: RELEASE_CALIBRATION.md.

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import numpy as np
import torch

BASE = "google/gemma-4-E2B"
AV_REPO = "Solshine/gemma-4-e2b-nla-L23-av-v0_0_1"

# Injection convention
INJECTION_TOKEN_ID = 249568           # ㊗
INJECTION_LEFT_NEIGHBOR_ID = 236813   # <
INJECTION_RIGHT_NEIGHBOR_ID = 954     # >
INJECTION_CHAR = chr(0x3297)
D_MODEL = 1536
INJECTION_SCALE = float(np.sqrt(D_MODEL))  # = 39.2; matches Gemma-4-E2B token-embed norm

PROMPT = (
    "You are a meticulous AI researcher conducting an important investigation "
    "into activation vectors from a language model. Your overall task is to "
    "describe the semantic content of that activation vector.\n\n"
    "We will pass the vector enclosed in <concept> tags into your context. "
    "You must then produce an explanation for the vector, enclosed within "
    "<explanation> tags. The explanation consists of 2-3 text snippets "
    "describing that vector.\n\nHere is the vector:\n\n"
    f"<concept>{INJECTION_CHAR}</concept>"
)

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4")
tok = AutoTokenizer.from_pretrained(BASE)
base = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto")
av = PeftModel.from_pretrained(base, AV_REPO); av.eval()

# At inference: hook the embedding layer to replace ㊗'s embedding with the
# scaled activation vector when the [<, ㊗, >] trio is detected.
pending = {"input_ids": None, "vec": None}
def hook(module, args_in, output):
    if output.shape[1] <= 1: return output
    ids = pending["input_ids"]; vec = pending["vec"]
    if ids is None or vec is None: return output
    h = output.clone()
    for b in range(ids.shape[0]):
        for p in range(1, ids.shape[1] - 1):
            if (ids[b,p].item() == INJECTION_TOKEN_ID
                and ids[b,p-1].item() == INJECTION_LEFT_NEIGHBOR_ID
                and ids[b,p+1].item() == INJECTION_RIGHT_NEIGHBOR_ID):
                h[b,p] = vec[b].to(h.dtype); break
    return h
av.get_input_embeddings().register_forward_hook(hook)

# Use
activation_vector = np.random.randn(D_MODEL).astype(np.float32)  # your 1536-d L23 activation
scaled = activation_vector / (np.linalg.norm(activation_vector) + 1e-9) * INJECTION_SCALE
ids = tok.encode(PROMPT, return_tensors="pt").to(av.device)
pending["input_ids"] = ids
pending["vec"] = torch.from_numpy(scaled).to(av.device).unsqueeze(0)

with torch.no_grad():
    out = av.generate(input_ids=ids, max_new_tokens=120, do_sample=False, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

Working end-to-end round-trip example with the matched AR: examples/round_trip_example.py in the bundled public repo.

Training setup

Base model: google/gemma-4-E2B (2B parameters, 35 text layers)
Activation layer: L23 residual stream
Quantization: NF4 4-bit base weights + fp16 LoRA adapters
LoRA config: r=64, α=128, target modules = model.language_model.layers.\d+.(self_attn|mlp).(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj) (language-model layers only; excludes audio tower)
Injection mechanism: forward hook on embedding layer; replaces ㊗ token's embedding with the L2-normalized activation rescaled to injection_scale = sqrt(d_model) = 39.2 (matches the empirically-measured Gemma-4-E2B token-embedding norm of 39.25)
Optimizer: AdamW 8-bit, lr=1e-4
Batch: micro_batch=1, grad_accum=16 → effective batch 16
Max length: 512 tokens
SFT steps: 15
Hardware: single 4 GB NVIDIA GTX 1650 Ti Max-Q (laptop)
Total wall time: ~3 GPU-hours end-to-end (including base-model NF4 load)
Training corpus: 2,548 (text, L23 activation, gpt-4o-mini-labeled explanation) triples on the v0.0.x baseline pipeline

Headline numbers (v0.0.1)

Round-trip cosine (paired with Solshine/gemma-4-e2b-nla-L23-ar-v0_0_1): 0.438 ± 0.054 on n=42 effective held-out activations, 100% above the 0.30 noise floor.
AV under SFT loss slope at converged: −0.0028/step linear regression on raw loss (descending verdict, R² ≥ 0.10).

What makes this release distinctive

First non-Anthropic-team open-source NLA AV at any model scale. As of 2026-05, every other NLA on HuggingFace Hub is under the kitft account (Kit Fraser-Taliente, the paper's first author and Anthropic's official reference). v0.0.1 is the second-source replication.
Consumer-GPU trainable. End-to-end training fits on a 4 GB laptop GPU. The methodology descope (NF4 + LoRA + small corpus + ≤300 SFT steps vs Anthropic's full bf16 fine-tune on 8–64 H100s) is documented per parameter.
Full open reproducibility chain in the bundled repo: Stage 0 (extraction) → Stage 1 (split) → Stage 2 (LLM-judge labeling) → Stage 3 (training-format build) → SFT → eval.

Limitations

NLAs can produce unexpected or incorrect explanations. Specifically for this release:

Thematic-correctness with detail-level confabulation is the realistic output class. The AV typically identifies the broad topic of the activation correctly (genre, dominant entity type, structural pattern) and confabulates specific tokens or examples that don't appear in the source. This matches the qualitative behavior documented for larger NLAs in the published literature; the small-model version here shows more confabulation per output.
Round-trip cosine has a structural-projection component. Replicating the published §"Measuring steganography" and §"Characterizing confabulations" tests on v0.0.1: paraphrasing the AV output moves the round-trip cosine by ~3% (Δcos = +0.014); removing entire claims from the AV output moves cosine by ~0% per claim (Δcos = +0.001 per claim). Most of the v0.0.1 round-trip-cosine signal is the AR's structural projection toward "somewhere in OpenWebText L23 activation space," not the explanation's specific content. Use AV-side per-row content-fidelity judging (validity × specificity × relatedness rubric) alongside round-trip cosine, never round-trip cosine alone.
Template-heavy outputs. Inspection shows ~80% of held-out-row outputs share a small set of structural templates with content-conditional fill-in slots. Use multiple feature angles + content-judge scoring rather than treating any single output as a verbatim summary of the activation.
Hardware-bound quality ceiling. Numbers reflect a single 4 GB GTX 1650 Ti Max-Q. Larger consumer GPUs with bf16 + full fine-tune + larger corpus would close some of the qualitative gap with the published reference NLAs.

Full development history including a methodology-bug retraction (§F72) and the autonomous-research-process retrospective: HISTORY.md.

Sidecar (training provenance YAML)

The companion nla_meta.yaml records training-time hyperparameters for round-tripping at inference. Read injection_scale from this file rather than hardcoding to avoid train-test mismatches.

Citation

@article{frasertaliente2026nla,
  title={Natural Language Autoencoders},
  author={Fraser-Taliente, Kit and Kantamneni, Kshitij and Ong, Antonia and others},
  journal={Transformer Circuits},
  year={2026},
  url={https://transformer-circuits.pub/2026/nla/}
}

@misc{deleeuw2026nlagemma4e2bav,
  title={Gemma-4-E2B NLA AV (v0.0.1): a 4 GB consumer-GPU Activation Verbalizer},
  author={DeLeeuw, Caleb (SolshineCode)},
  year={2026},
  url={https://huggingface.co/Solshine/gemma-4-e2b-nla-L23-av-v0_0_1}
}

License

CC-BY 4.0. See LICENSE in the bundled repo.

Downloads last month: 188

Model tree for Solshine/gemma-4-e2b-nla-L23-av-v0_0_1

Base model

google/gemma-4-E2B

Adapter

(20)

this model