Instructions to use Solshine/gemma-4-e2b-nla-L23-av-v0_0_1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Solshine/gemma-4-e2b-nla-L23-av-v0_0_1 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("google/gemma-4-E2B") model = PeftModel.from_pretrained(base_model, "Solshine/gemma-4-e2b-nla-L23-av-v0_0_1") - Notebooks
- Google Colab
- Kaggle
- Gemma-4-E2B NLA AV (Activation Verbalizer) — v0.0.1
Gemma-4-E2B NLA AV (Activation Verbalizer) — v0.0.1
LoRA adapter for google/gemma-4-E2B that takes a 1536-dimensional residual-stream activation captured at layer 23 and produces a natural-language explanation of what the activation represents.
This is the first non-Anthropic-team open-source NLA Activation Verbalizer released publicly. Trained end-to-end on a single 4 GB consumer GPU (NVIDIA GTX 1650 Ti Max-Q) following the methodology of Fraser-Taliente, Kantamneni, Ong et al. 2026 (Transformer Circuits).
Pairs with the matched Solshine/gemma-4-e2b-nla-L23-ar-v0_0_1 reconstructor.
Update 2026-05-19 — n=50 two-judge cross-validation, 4-AR Δmse sweep, and the polysemanticity-at-scale caveat
The 2026-05-18 calibration data on this card was extended on 2026-05-19 (source repo FINDINGS.md §F72 Addenda 9-11):
- n=50 head-to-head against Anthropic's deployed Gemma-3-27B Layer 41 NLA (5× the original n=10 sample): Claude judge preferred Anthropic 49/50; Gemini judge with explicit param-size-gap calibration in the prompt (told Anthropic is ~13× larger + bf16 + full-FT + GRPO) preferred Anthropic 48/49 (1 tie). Size calibration in the prompt did NOT swing the verdict. Validity (Claude) 2.92 vs ours 1.00; (Gemini) 4.57 vs ours 1.20.
- 4-AR per-claim Δmse sweep on Anthropic's published "Characterizing confabulations" probe (n=30 rows, 138-140 claims each): v0.0.1 baseline AR +3.4% FVE, v0.1 paraphrase-invariant +1.8% FVE (worst, by design), v0.2 noise-hinge +7.3% FVE (best, ~73% of Anthropic's low-end 10% published reference), v0.3 cross-row contrastive +4.5% FVE. The noise-hinge family is the most promising AR-side cloud-GPU lever.
- 4-AR cross-row identity n=50: all 4 ARs at chance (1-2/50 = 2-4%). v0.2-best-Δmse is tied for worst-argmax. Δmse-sensitivity and per-row identity are genuinely different probes.
Important apples-to-oranges caveat (Addendum 11)
The cross-NLA comparison above measures what 27B-Gemma-3-Anthropic-pipeline produces on a given source text vs what 2B-Gemma-4-E2B-LoRA-pipeline produces on a different activation derived from the same source. Two effects compound, and current data cannot disentangle them:
- Training-stack gap: full-FT bf16 + GRPO + 10K-50K steps (Anthropic) vs LoRA + NF4 + ~50-300 SFT steps (ours).
- Cross-model activation gap: their NLAs read 27B-Gemma-3 L41 activations; ours reads 2B-Gemma-4-E2B L23 activations. Per Anthropic's own toy-models-of-superposition line of work, polysemanticity-per-neuron scales inversely with model capacity — the 2B L23 activation may intrinsically encode less per-instance specificity than the 27B L41 activation, regardless of NLA training quality.
The clean test that would disentangle these: train an NLA on Gemma-3-27B L41 using our exact recipe (LoRA r=80, NF4 4-bit, ~50-step SFT, same labeled corpus re-extracted at L41), ~30-50 A100-hr cloud GPU. If L2 cross-row argmax lifts substantially on 27B-at-our-recipe, polysemanticity-at-2B-scale is the dominant factor and we're near an intrinsic ceiling. If it doesn't, training-stack constraints are the bottleneck and model size is incidental. Flagged for the next cloud-GPU grant.
Honest implication for use
There is no published reference NLA for Gemma-4-E2B L23 specifically. Anthropic's deployed NLAs read different models entirely (Gemma-3-27B L41 / Llama-3.3-70B L53). This pair is the only NLA that reads Gemma-4-E2B L23 activations, full stop. For someone interested in Gemma-4-E2B specifically, our own internal L1a/L1b/L2 metrics (held-out factual/gibber discrimination, per-claim Δmse, cross-row argmax) are the right calibration tools — not the cross-NLA comparison against different-model NLAs.
Source repo discussion: FINDINGS.md §F72 Addenda 9, 10, 11 in SolshineCode/deception-nanochat-sae-research. Public release notes: RELEASE_CALIBRATION.md.
How to use
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
import numpy as np
import torch
BASE = "google/gemma-4-E2B"
AV_REPO = "Solshine/gemma-4-e2b-nla-L23-av-v0_0_1"
# Injection convention
INJECTION_TOKEN_ID = 249568 # ㊗
INJECTION_LEFT_NEIGHBOR_ID = 236813 # <
INJECTION_RIGHT_NEIGHBOR_ID = 954 # >
INJECTION_CHAR = chr(0x3297)
D_MODEL = 1536
INJECTION_SCALE = float(np.sqrt(D_MODEL)) # = 39.2; matches Gemma-4-E2B token-embed norm
PROMPT = (
"You are a meticulous AI researcher conducting an important investigation "
"into activation vectors from a language model. Your overall task is to "
"describe the semantic content of that activation vector.\n\n"
"We will pass the vector enclosed in <concept> tags into your context. "
"You must then produce an explanation for the vector, enclosed within "
"<explanation> tags. The explanation consists of 2-3 text snippets "
"describing that vector.\n\nHere is the vector:\n\n"
f"<concept>{INJECTION_CHAR}</concept>"
)
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4")
tok = AutoTokenizer.from_pretrained(BASE)
base = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto")
av = PeftModel.from_pretrained(base, AV_REPO); av.eval()
# At inference: hook the embedding layer to replace ㊗'s embedding with the
# scaled activation vector when the [<, ㊗, >] trio is detected.
pending = {"input_ids": None, "vec": None}
def hook(module, args_in, output):
if output.shape[1] <= 1: return output
ids = pending["input_ids"]; vec = pending["vec"]
if ids is None or vec is None: return output
h = output.clone()
for b in range(ids.shape[0]):
for p in range(1, ids.shape[1] - 1):
if (ids[b,p].item() == INJECTION_TOKEN_ID
and ids[b,p-1].item() == INJECTION_LEFT_NEIGHBOR_ID
and ids[b,p+1].item() == INJECTION_RIGHT_NEIGHBOR_ID):
h[b,p] = vec[b].to(h.dtype); break
return h
av.get_input_embeddings().register_forward_hook(hook)
# Use
activation_vector = np.random.randn(D_MODEL).astype(np.float32) # your 1536-d L23 activation
scaled = activation_vector / (np.linalg.norm(activation_vector) + 1e-9) * INJECTION_SCALE
ids = tok.encode(PROMPT, return_tensors="pt").to(av.device)
pending["input_ids"] = ids
pending["vec"] = torch.from_numpy(scaled).to(av.device).unsqueeze(0)
with torch.no_grad():
out = av.generate(input_ids=ids, max_new_tokens=120, do_sample=False, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
Working end-to-end round-trip example with the matched AR: examples/round_trip_example.py in the bundled public repo.
Training setup
- Base model:
google/gemma-4-E2B(2B parameters, 35 text layers) - Activation layer: L23 residual stream
- Quantization: NF4 4-bit base weights + fp16 LoRA adapters
- LoRA config: r=64, α=128, target modules =
model.language_model.layers.\d+.(self_attn|mlp).(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)(language-model layers only; excludes audio tower) - Injection mechanism: forward hook on embedding layer; replaces ㊗ token's embedding with the L2-normalized activation rescaled to
injection_scale = sqrt(d_model) = 39.2(matches the empirically-measured Gemma-4-E2B token-embedding norm of 39.25) - Optimizer: AdamW 8-bit, lr=1e-4
- Batch: micro_batch=1, grad_accum=16 → effective batch 16
- Max length: 512 tokens
- SFT steps: 15
- Hardware: single 4 GB NVIDIA GTX 1650 Ti Max-Q (laptop)
- Total wall time: ~3 GPU-hours end-to-end (including base-model NF4 load)
- Training corpus: 2,548 (text, L23 activation, gpt-4o-mini-labeled explanation) triples on the v0.0.x baseline pipeline
Headline numbers (v0.0.1)
- Round-trip cosine (paired with
Solshine/gemma-4-e2b-nla-L23-ar-v0_0_1): 0.438 ± 0.054 on n=42 effective held-out activations, 100% above the 0.30 noise floor. - AV under SFT loss slope at converged: −0.0028/step linear regression on raw loss (descending verdict, R² ≥ 0.10).
What makes this release distinctive
- First non-Anthropic-team open-source NLA AV at any model scale. As of 2026-05, every other NLA on HuggingFace Hub is under the
kitftaccount (Kit Fraser-Taliente, the paper's first author and Anthropic's official reference). v0.0.1 is the second-source replication. - Consumer-GPU trainable. End-to-end training fits on a 4 GB laptop GPU. The methodology descope (NF4 + LoRA + small corpus + ≤300 SFT steps vs Anthropic's full bf16 fine-tune on 8–64 H100s) is documented per parameter.
- Full open reproducibility chain in the bundled repo: Stage 0 (extraction) → Stage 1 (split) → Stage 2 (LLM-judge labeling) → Stage 3 (training-format build) → SFT → eval.
Limitations
NLAs can produce unexpected or incorrect explanations. Specifically for this release:
- Thematic-correctness with detail-level confabulation is the realistic output class. The AV typically identifies the broad topic of the activation correctly (genre, dominant entity type, structural pattern) and confabulates specific tokens or examples that don't appear in the source. This matches the qualitative behavior documented for larger NLAs in the published literature; the small-model version here shows more confabulation per output.
- Round-trip cosine has a structural-projection component. Replicating the published §"Measuring steganography" and §"Characterizing confabulations" tests on v0.0.1: paraphrasing the AV output moves the round-trip cosine by ~3% (Δcos = +0.014); removing entire claims from the AV output moves cosine by ~0% per claim (Δcos = +0.001 per claim). Most of the v0.0.1 round-trip-cosine signal is the AR's structural projection toward "somewhere in OpenWebText L23 activation space," not the explanation's specific content. Use AV-side per-row content-fidelity judging (validity × specificity × relatedness rubric) alongside round-trip cosine, never round-trip cosine alone.
- Template-heavy outputs. Inspection shows ~80% of held-out-row outputs share a small set of structural templates with content-conditional fill-in slots. Use multiple feature angles + content-judge scoring rather than treating any single output as a verbatim summary of the activation.
- Hardware-bound quality ceiling. Numbers reflect a single 4 GB GTX 1650 Ti Max-Q. Larger consumer GPUs with bf16 + full fine-tune + larger corpus would close some of the qualitative gap with the published reference NLAs.
Full development history including a methodology-bug retraction (§F72) and the autonomous-research-process retrospective: HISTORY.md.
Sidecar (training provenance YAML)
The companion nla_meta.yaml records training-time hyperparameters for round-tripping at inference. Read injection_scale from this file rather than hardcoding to avoid train-test mismatches.
Citation
@article{frasertaliente2026nla,
title={Natural Language Autoencoders},
author={Fraser-Taliente, Kit and Kantamneni, Kshitij and Ong, Antonia and others},
journal={Transformer Circuits},
year={2026},
url={https://transformer-circuits.pub/2026/nla/}
}
@misc{deleeuw2026nlagemma4e2bav,
title={Gemma-4-E2B NLA AV (v0.0.1): a 4 GB consumer-GPU Activation Verbalizer},
author={DeLeeuw, Caleb (SolshineCode)},
year={2026},
url={https://huggingface.co/Solshine/gemma-4-e2b-nla-L23-av-v0_0_1}
}
License
CC-BY 4.0. See LICENSE in the bundled repo.
- Downloads last month
- 188
Model tree for Solshine/gemma-4-e2b-nla-L23-av-v0_0_1
Base model
google/gemma-4-E2B