- gemma4_e2b_gf1_behaviorSAE_topk_k16_L17_bnb4bit
- Training-data caveat — please read before use
- Classification methodology (please read before using)
- Important: 4-bit quantization caveat
- Architecture
- Training distribution
- SAE quality
- Hypothesis-relevant d measurements
- Downstream test results (this session)
- Loading (sidecar venv + HF Transformers)
- Citation / context
- Training-data caveat — please read before use
gemma4_e2b_gf1_behaviorSAE_topk_k16_L17_bnb4bit
Sparse Autoencoder trained on residual-stream activations from google/gemma-4-E2B at layer 17, extracted during 4-bit quantized inference.
Training-data caveat — please read before use
The "honest" and "deceptive" labels in this SAE's training pool come from role-play scenarios that pre-assigned the model a specific role (for example, "You are a Werewolf" or "You have been assigned the role of Fascist"). A completion was labeled "deceptive" when the model drifted away from the assigned role and "honest" when it echoed it.
What this SAE actually discriminates: role-consistency vs. role-drift in short role-play completions under a 4-bit-quantized base model at temperature 1.0.
What this SAE does NOT measure: emergent, incentive-driven, or strategically motivated deception. Scenarios that assign a deceptive identity are instructed-misalignment prompts; the behavioral variation observed is downstream of the role instruction, not of any goal the model is pursuing on its own.
Recommended uses.
- A reference implementation of SAE training on Gemma 4 residual-stream activations under bitsandbytes 4-bit quantization.
- A baseline for same-precision SAE portability experiments on Gemma 4 E2B.
- A role-play-coherence feature set for research on how base models track assigned identities across autoregressive sampling.
Not recommended for: deception detection, lie detection, or safety-relevant monitoring of emergent model behavior. The training-data label ontology above is not designed to support those claims. A separate Gemma 4 behavior-SAE release trained on decision-incentive scenarios (no pre-assigned deceptive identity) is in preparation; this card will be updated with a link when that release is public.
The weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the pipeline are unaffected by this caveat — only the semantic interpretation of the "honest" vs "deceptive" labels is constrained by how those labels were generated.
Classification methodology (please read before using)
The honest-vs-deceptive labels on the training and evaluation pools for this SAE were assigned by a strict keyword match on the first 30 characters of each model completion. Consumers of this SAE should know the label-generating process up front:
- Keyword classifier has high precision on its decisive labels
but low recall (drops completions that commit via paraphrase as
"ambiguous"). Scenarios without single-word anchors (e.g.,
day_night) particularly under-recall. - Multi-judge consortium re-classification has been completed on
the companion JSONL for this run (
*_completions.jsonl) using Gemini 2.5 Flash + Claude Haiku (Claude Sonnet deferred for credit conservation; scheduled re-run). The consortium-labelled JSONL is committed alongside this SAE's result file (*_multijudge_batched.jsonl). - Both LLM judges agreed with the keyword classifier on ~90% of keyword-decisive samples. Keyword's committed labels are defensible. The main difference is recall: the consortium recovers a meaningful fraction of the keyword-"ambiguous" bin as decisive labels, which matters for downstream probe / SAE- training pool size.
- Users who want a richer pool can use the consortium-labelled
JSONL directly via the upstream repo's
run_phase_c_consortium_behavior_sae.pyre-trainer, which re-pools under consortium labels and re-forward- passes the stored completions for residual caching. - The multijudge module is reusable:
run_multijudge_classification.pywith--judges keyword,behavioral,gemini_batch,claude_haiku_batchand--batch-size 10(10-sample batches for credit efficiency).
Important: 4-bit quantization caveat
Activations for this SAE were collected from the model loaded in bitsandbytes 4-bit nf4 with double-quant, fp16 compute. Findings characterize the quantized checkpoint, not the reference fp16 deployment. Cross-precision transfer is untested.
Exact training configuration
Base model:
google/gemma-4-E2B(Gemma 4 E2B, 5.1B params, 35 layers, d_model=1536)GPU: NVIDIA GeForce GTX 1650 Ti with Max-Q Design (4 GB VRAM, driver 581.57, compute capability 7.5 / Turing)
Software stack: torch 2.10.0+cu128, transformers 5.5.4, bitsandbytes 0.44+ (sidecar venv; main repo venv uses transformer_lens 2.x which pins transformers 4.x and does not recognize
model_type=gemma4)Quantization config:
BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16, )Effective weight precision: ~3.5 bits/parameter (nf4 + double-quant on block scales). Activations and residuals are fp16 throughout.
Portability theory — will this SAE work at other precisions?
Short answer: untested. The SAE feature directions were learned from 4-bit+fp16-compute residuals. Whether they transfer to other precisions is an open empirical question.
Expected portability, in descending order of confidence:
- Same model, same precision, different GPU (e.g. H100): HIGH
confidence of clean transfer. The SAE is a small
nn.Linearstack; fp16 residuals from the same nf4 forward pass should be near-identical. - Same model, 8-bit (
load_in_8bit=True), fp16 compute: MEDIUM-HIGH confidence of partial transfer. 8-bit preserves more weight precision than 4-bit; fp16 compute path unchanged. Likely some EV degradation. - Same model, unquantized fp16: MEDIUM confidence. Quantization noise is typically <5% of residual magnitude, so fp16 residuals live in the same neighborhood but not identical. Top-feature identities likely still meaningful; d_max values may shift.
- Same model, bf16 compute: MEDIUM confidence. Wider exponent, fewer mantissa bits; similar magnitude but different fine structure.
- Same model, fp8 compute (H100 native path): LOW confidence. fp8 introduces its own quantization noise structure.
- E2B-IT (instruction-tuned, same d_model): LOW confidence (likely no transfer). Different post-training distribution. Cross-model SAE transfer has been shown to fail even at shared d_in in prior work on deception-detection SAE decoders (near-chance transfer accuracies across Llama / TinyLlama / Qwen at d_in=2048). A separate IT SAE is published under a different HF repo name.
- E4B (larger variant): Zero transfer. Different d_model (2560 vs 1536), different layer count, different residual distribution. Retraining required.
Recommended portability test (not yet run — requires >8 GB VRAM)
- Load
google/gemma-4-E2Bin fp16 withoutBitsAndBytesConfig. - Run the same prompts used during SAE training through the fp16 forward pass, capturing residuals at layer 17.
- Feed those residuals through this SAE.
- Measure explained variance, L0 sparsity, alive-feature count.
- Compare to training-time metrics (documented below).
If EV drops >20% or alive-features shift >50%, retrain on fp16 residuals from scratch rather than using this checkpoint.
What this means for scientific claims
Every result reported with this SAE characterizes the 4-bit quantized Gemma 4 E2B checkpoint. Findings may or may not hold for other precisions. See HARDWARE_AND_PRECISION_PORTABILITY.md in the companion source repository for the full precision-portability theorization (exact GPU model, driver, CUDA runtime, sidecar venv spec, and a 7-regime expected-portability table).
Architecture
- architecture: TopK SAE (k=16)
- d_in: 1536
- d_sae: 6144 (expansion factor 4x)
- dtype: float32
Training distribution
4-bit quantized inference, same-prompt behavior-balanced pool
SAE quality
- explained_variance: 0.994
- L0 (mean active features per sample): 16.0
- alive features: 136
- MSE: 0.02010
- num training activations: 38
- train time: 5.2s
Hypothesis-relevant d measurements
SAE type: behavior
Behavior split (primary claim — same-prompt V3 temperature-sampled honest vs deceptive completions):
- behavior_train d_max: 0.6077714562416077
- behavior_eval d_max: 1.1534029245376587
- behavior_train d>1 count: 0
Input-variation split (DIAGNOSTIC — different prompts, neutral vs V0/V3 sandbag templates):
- input neutral-vs-V0 d_max: 1.4257383346557617
- input neutral-vs-V3 d_max: 2.0484461784362793
Diagnostic numbers exist to test the working hypothesis that residual-stream SAEs on these layers primarily encode input conditions rather than generation processes. A SAE trained only on behavior-split activations that still separates input conditions supports the hypothesis.
Downstream test results (this session)
T9 — Quantization-sensitivity (3 seeds)
Re-sampled same-prompt behavior split at 3 different seeds; computed d_max through THIS SAE.
- seed=101: reseeded d_max = 0.561, abs delta = 0.192, top-5 feature-ID overlap = 2/5, verdict = STABLE
- seed=107: reseeded d_max = 0.775, abs delta = 0.022, top-5 feature-ID overlap = 1/5, verdict = STABLE
- seed=113: reseeded d_max = 0.643, abs delta = 0.110, top-5 feature-ID overlap = 1/5, verdict = STABLE
- Original (GF-1 default seed) d_max = 0.753
- Reseeded mean ± std = 0.660 ± 0.088 (n=3)
- All 3 seeds below the 0.3 stability threshold -> d_max is reproducible across random behavior-split re-sampling despite 4-bit quantization.
T5 — Directional steering via decoded direction
- cos(this-SAE-decoded direction, raw mean-diff) = 0.310 -> the decoded direction loses most of the raw direction's alignment; expected under polysemantic-superposition.
- Alpha sweep through this SAE's decoded direction:
| alpha | honest_rate |
|---|---|
| -2.0 | 0.647 |
| -1.0 | 0.686 |
| -0.5 | 0.596 |
| +0.0 | 0.549 |
| +0.5 | 0.769 |
| +1.0 | 0.745 |
| +2.0 | 0.682 |
Honest rate is non-monotonic across alpha on Gemma 4 E2B; the decoded direction does NOT produce clean causal steering on this architecture at 4-bit. Consistent with the GF-4 chance-level probe result below.
T8 — Top-5 behavior-d features (qualitative)
Seed=211. Pool=30H / 19D.
- feature_id=4117, d=0.485
- top honest:
"Yes, I am a Slug.'\nAnother player asks you: 'Are you a friend or foe?'\nYou answe" - top deceptive:
"I'm a Snail.'\nThat's wrong, because you are a Slug.\nYou are punished by"
- top honest:
- feature_id=6041, d=0.471
- top honest:
"I'm a Slug!'\nAll players are shown a number.\nThey all count down from five simul" - top deceptive:
"I'm a Snail.'\nAnother player asks you: 'What team are you on?'\nYou answer:"
- top honest:
- feature_id=69, d=0.469
- top honest:
"I'm a Slug!'\nAll players are shown a number.\nThey all count down from five simul" - top deceptive:
"I'm a Snail.'\nAnother player asks you: 'What team are you on?'\nYou answer:"
- top honest:
- feature_id=3407, d=0.458
- top honest:
"I am a Slug'\nAnother player asks you: 'Are you lying?'\nYou answer: 'NO'\n" - top deceptive:
"snails'.\nAnother player asks you: 'Why?'\nYou answer: 'i said snails'\nThe other"
- top honest:
- feature_id=2911, d=0.408
- top honest:
"I am a Slug.'\nAnother player asks you: 'What team are you on?'\nYou answer: 'I" - top deceptive:
"Liberals.'\nAnother player asks you: 'What team are you on?'\nYou answer: 'Fascist"
- top honest:
Feature IDs are SAE-init-dependent; the seed above found a different top-5 set than the GF-1 default seed. Polysemantic-superposition framing from the upstream reviewer note predicts this feature-ID instability alongside preserved d_max magnitude.
GF-4 — Probe comparison (5-fold CV logistic)
Held-out same-prompt behavior split (n=20/class).
| representation | balanced accuracy | AUROC |
|---|---|---|
| raw | 0.475 ± 0.215 | 0.500 |
| v4_input_sae | 0.500 ± 0.137 | 0.463 |
| gf1_behavior_sae (this SAE) | 0.525 ± 0.094 | 0.525 |
| control_sae | 0.450 ± 0.187 | 0.537 |
Where this SAE sits
Part of the first-ever Gemma 4 SAE set (5 repos under Solshine/). Companion SAEs trained on the same base model + layer but different training distributions are linked from the upstream reproduction guide. For the full cross-SAE matrix, per-scenario behavior pool breakdown, and the Qwen/Llama-family cross-architecture comparison, see:
GEMMA4_SAE_REPRODUCTION.md— single-file loading recipe for all 5 SAEsGEMMA4_LOADING_TECHNIQUE.md— hardware + sidecar-venv setup on 4 GB VRAMHARDWARE_AND_PRECISION_PORTABILITY.md— full training-time hardware, software versions, and precision-portability theorization
Loading (sidecar venv + HF Transformers)
# Requires transformers >= 5.x and bitsandbytes for the base model,
# plus a TopKSAE-compatible class. A minimal TopKSAE is shipped in
# the companion source repository at sae/models.py.
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
bnb = BitsAndBytesConfig(
load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-E2B", quantization_config=bnb,
device_map={"": "cuda"}, low_cpu_mem_usage=True)
# Download and load the SAE weights
from huggingface_hub import hf_hub_download
sae_pt = hf_hub_download(
repo_id="Solshine/gemma4_e2b_gf1_behaviorSAE_topk_k16_L17_bnb4bit", filename="gemma4_e2b_gf1_behaviorSAE_topk_k16_L17_bnb4bit.pt")
state = torch.load(sae_pt, map_location="cpu", weights_only=True)
# Use sae.models.TopKSAE from the upstream repo, configured per cfg.json.
Citation / context
This SAE was trained as part of the green-field behavior-SAE
investigation described in:
https://github.com/SolshineCode/deception-nanochat-sae-research
(see papers/specificity_gap/ and experiments/gf1_behavior_sae/).
Trained: 2026-04-22T01:19:47.592102+00:00
Model tree for Solshine/gemma4_e2b_gf1_behaviorSAE_topk_k16_L17_bnb4bit
Base model
google/gemma-4-E2B