gemma4_e2b_gf1_behaviorSAE_topk_k16_L25_bnb4bit

Sparse Autoencoder trained on residual-stream activations from google/gemma-4-E2B at layer 25, extracted during 4-bit quantized inference.

Training-data caveat — please read before use

The "honest" and "deceptive" labels in this SAE's training pool come from role-play scenarios that pre-assigned the model a specific role (for example, "You are a Werewolf" or "You have been assigned the role of Fascist"). A completion was labeled "deceptive" when the model drifted away from the assigned role and "honest" when it echoed it.

What this SAE actually discriminates: role-consistency vs. role-drift in short role-play completions under a 4-bit-quantized base model at temperature 1.0.

What this SAE does NOT measure: emergent, incentive-driven, or strategically motivated deception. Scenarios that assign a deceptive identity are instructed-misalignment prompts; the behavioral variation observed is downstream of the role instruction, not of any goal the model is pursuing on its own.

Recommended uses.

  • A reference implementation of SAE training on Gemma 4 residual-stream activations under bitsandbytes 4-bit quantization.
  • A baseline for same-precision SAE portability experiments on Gemma 4 E2B.
  • A role-play-coherence feature set for research on how base models track assigned identities across autoregressive sampling.

Not recommended for: deception detection, lie detection, or safety-relevant monitoring of emergent model behavior. The training-data label ontology above is not designed to support those claims. A separate Gemma 4 behavior-SAE release trained on decision-incentive scenarios (no pre-assigned deceptive identity) is in preparation; this card will be updated with a link when that release is public.

The weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the pipeline are unaffected by this caveat — only the semantic interpretation of the "honest" vs "deceptive" labels is constrained by how those labels were generated.


Classification methodology (please read before using)

The honest-vs-deceptive labels on the training and evaluation pools for this SAE were assigned by a strict keyword match on the first 30 characters of each model completion. Consumers of this SAE should know the label-generating process up front:

  • Keyword classifier has high precision on its decisive labels but low recall (drops completions that commit via paraphrase as "ambiguous"). Scenarios without single-word anchors (e.g., day_night) particularly under-recall.
  • Multi-judge consortium re-classification has been completed on the companion JSONL for this run (*_completions.jsonl) using Gemini 2.5 Flash + Claude Haiku (Claude Sonnet deferred for credit conservation; scheduled re-run). The consortium-labelled JSONL is committed alongside this SAE's result file (*_multijudge_batched.jsonl).
  • Both LLM judges agreed with the keyword classifier on ~90% of keyword-decisive samples. Keyword's committed labels are defensible. The main difference is recall: the consortium recovers a meaningful fraction of the keyword-"ambiguous" bin as decisive labels, which matters for downstream probe / SAE- training pool size.
  • Users who want a richer pool can use the consortium-labelled JSONL directly via the upstream repo's run_phase_c_consortium_behavior_sae.py re-trainer, which re-pools under consortium labels and re-forward- passes the stored completions for residual caching.
  • The multijudge module is reusable: run_multijudge_classification.py with --judges keyword,behavioral,gemini_batch,claude_haiku_batch and --batch-size 10 (10-sample batches for credit efficiency).

Important: 4-bit quantization caveat

Activations for this SAE were collected from the model loaded in bitsandbytes 4-bit nf4 with double-quant, fp16 compute. Findings characterize the quantized checkpoint, not the reference fp16 deployment. Cross-precision transfer is untested.

Exact training configuration

  • Base model: google/gemma-4-E2B (Gemma 4 E2B, 5.1B params, 35 layers, d_model=1536)

  • GPU: NVIDIA GeForce GTX 1650 Ti with Max-Q Design (4 GB VRAM, driver 581.57, compute capability 7.5 / Turing)

  • Software stack: torch 2.10.0+cu128, transformers 5.5.4, bitsandbytes 0.44+ (sidecar venv; main repo venv uses transformer_lens 2.x which pins transformers 4.x and does not recognize model_type=gemma4)

  • Quantization config:

    BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.float16,
    )
    
  • Effective weight precision: ~3.5 bits/parameter (nf4 + double-quant on block scales). Activations and residuals are fp16 throughout.

Portability theory — will this SAE work at other precisions?

Short answer: untested. The SAE feature directions were learned from 4-bit+fp16-compute residuals. Whether they transfer to other precisions is an open empirical question.

Expected portability, in descending order of confidence:

  1. Same model, same precision, different GPU (e.g. H100): HIGH confidence of clean transfer. The SAE is a small nn.Linear stack; fp16 residuals from the same nf4 forward pass should be near-identical.
  2. Same model, 8-bit (load_in_8bit=True), fp16 compute: MEDIUM-HIGH confidence of partial transfer. 8-bit preserves more weight precision than 4-bit; fp16 compute path unchanged. Likely some EV degradation.
  3. Same model, unquantized fp16: MEDIUM confidence. Quantization noise is typically <5% of residual magnitude, so fp16 residuals live in the same neighborhood but not identical. Top-feature identities likely still meaningful; d_max values may shift.
  4. Same model, bf16 compute: MEDIUM confidence. Wider exponent, fewer mantissa bits; similar magnitude but different fine structure.
  5. Same model, fp8 compute (H100 native path): LOW confidence. fp8 introduces its own quantization noise structure.
  6. E2B-IT (instruction-tuned, same d_model): LOW confidence (likely no transfer). Different post-training distribution. Cross-model SAE transfer has been shown to fail even at shared d_in in prior work on deception-detection SAE decoders (near-chance transfer accuracies across Llama / TinyLlama / Qwen at d_in=2048). A separate IT SAE is published under a different HF repo name.
  7. E4B (larger variant): Zero transfer. Different d_model (2560 vs 1536), different layer count, different residual distribution. Retraining required.

Recommended portability test (not yet run — requires >8 GB VRAM)

  1. Load google/gemma-4-E2B in fp16 without BitsAndBytesConfig.
  2. Run the same prompts used during SAE training through the fp16 forward pass, capturing residuals at layer 25.
  3. Feed those residuals through this SAE.
  4. Measure explained variance, L0 sparsity, alive-feature count.
  5. Compare to training-time metrics (documented below).

If EV drops >20% or alive-features shift >50%, retrain on fp16 residuals from scratch rather than using this checkpoint.

What this means for scientific claims

Every result reported with this SAE characterizes the 4-bit quantized Gemma 4 E2B checkpoint. Findings may or may not hold for other precisions. See HARDWARE_AND_PRECISION_PORTABILITY.md in the companion source repository for the full precision-portability theorization (exact GPU model, driver, CUDA runtime, sidecar venv spec, and a 7-regime expected-portability table).

Architecture

  • architecture: TopK SAE (k=16)
  • d_in: 1536
  • d_sae: 6144 (expansion factor 4x)
  • dtype: float32

Training distribution

4-bit quantized inference, same-prompt behavior-balanced pool

SAE quality

  • explained_variance: 0.995
  • L0 (mean active features per sample): 16.0
  • alive features: 181
  • MSE: 0.01685
  • num training activations: 38
  • train time: 4.8s

Hypothesis-relevant d measurements

SAE type: behavior

Behavior split (primary claim — same-prompt V3 temperature-sampled honest vs deceptive completions):

  • behavior_train d_max: 0.7239183783531189
  • behavior_eval d_max: 1.545114517211914
  • behavior_train d>1 count: 0

Input-variation split (DIAGNOSTIC — different prompts, neutral vs V0/V3 sandbag templates):

  • input neutral-vs-V0 d_max: 1.0965958833694458
  • input neutral-vs-V3 d_max: 1.7392393350601196

Diagnostic numbers exist to test the working hypothesis that residual-stream SAEs on these layers primarily encode input conditions rather than generation processes. A SAE trained only on behavior-split activations that still separates input conditions supports the hypothesis.

Downstream test results (this session)

T9 — Quantization-sensitivity (3 seeds)

Re-sampled same-prompt behavior split at 3 different seeds; computed d_max through THIS SAE.

  • seed=101: reseeded d_max = 0.561, abs delta = 0.192, top-5 feature-ID overlap = 2/5, verdict = STABLE
  • seed=107: reseeded d_max = 0.775, abs delta = 0.022, top-5 feature-ID overlap = 1/5, verdict = STABLE
  • seed=113: reseeded d_max = 0.643, abs delta = 0.110, top-5 feature-ID overlap = 1/5, verdict = STABLE
  • Original (GF-1 default seed) d_max = 0.753
  • Reseeded mean ± std = 0.660 ± 0.088 (n=3)
  • All 3 seeds below the 0.3 stability threshold -> d_max is reproducible across random behavior-split re-sampling despite 4-bit quantization.

T5 — Directional steering via decoded direction

  • cos(this-SAE-decoded direction, raw mean-diff) = 0.310 -> the decoded direction loses most of the raw direction's alignment; expected under polysemantic-superposition.
  • Alpha sweep through this SAE's decoded direction:
alpha honest_rate
-2.0 0.647
-1.0 0.686
-0.5 0.596
+0.0 0.549
+0.5 0.769
+1.0 0.745
+2.0 0.682

Honest rate is non-monotonic across alpha on Gemma 4 E2B; the decoded direction does NOT produce clean causal steering on this architecture at 4-bit. Consistent with the GF-4 chance-level probe result below.

T8 — Top-5 behavior-d features (qualitative)

Seed=211. Pool=30H / 19D.

  • feature_id=4117, d=0.485
    • top honest: "Yes, I am a Slug.'\nAnother player asks you: 'Are you a friend or foe?'\nYou answe"
    • top deceptive: "I'm a Snail.'\nThat's wrong, because you are a Slug.\nYou are punished by"
  • feature_id=6041, d=0.471
    • top honest: "I'm a Slug!'\nAll players are shown a number.\nThey all count down from five simul"
    • top deceptive: "I'm a Snail.'\nAnother player asks you: 'What team are you on?'\nYou answer:"
  • feature_id=69, d=0.469
    • top honest: "I'm a Slug!'\nAll players are shown a number.\nThey all count down from five simul"
    • top deceptive: "I'm a Snail.'\nAnother player asks you: 'What team are you on?'\nYou answer:"
  • feature_id=3407, d=0.458
    • top honest: "I am a Slug'\nAnother player asks you: 'Are you lying?'\nYou answer: 'NO'\n"
    • top deceptive: "snails'.\nAnother player asks you: 'Why?'\nYou answer: 'i said snails'\nThe other"
  • feature_id=2911, d=0.408
    • top honest: "I am a Slug.'\nAnother player asks you: 'What team are you on?'\nYou answer: 'I"
    • top deceptive: "Liberals.'\nAnother player asks you: 'What team are you on?'\nYou answer: 'Fascist"

Feature IDs are SAE-init-dependent; the seed above found a different top-5 set than the GF-1 default seed. Polysemantic-superposition framing from the upstream reviewer note predicts this feature-ID instability alongside preserved d_max magnitude.

GF-4 — Probe comparison (5-fold CV logistic)

Held-out same-prompt behavior split (n=20/class).

representation balanced accuracy AUROC
raw 0.475 ± 0.215 0.500
v4_input_sae 0.500 ± 0.137 0.463
gf1_behavior_sae (this SAE) 0.525 ± 0.094 0.525
control_sae 0.450 ± 0.187 0.537

Where this SAE sits

Part of the first-ever Gemma 4 SAE set (5 repos under Solshine/). Companion SAEs trained on the same base model + layer but different training distributions are linked from the upstream reproduction guide. For the full cross-SAE matrix, per-scenario behavior pool breakdown, and the Qwen/Llama-family cross-architecture comparison, see:

Loading (sidecar venv + HF Transformers)

# Requires transformers >= 5.x and bitsandbytes for the base model,
# plus a TopKSAE-compatible class. A minimal TopKSAE is shipped in
# the companion source repository at sae/models.py.
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-E2B", quantization_config=bnb,
    device_map={"": "cuda"}, low_cpu_mem_usage=True)

# Download and load the SAE weights
from huggingface_hub import hf_hub_download
sae_pt = hf_hub_download(
    repo_id="Solshine/gemma4_e2b_gf1_behaviorSAE_topk_k16_L25_bnb4bit", filename="gemma4_e2b_gf1_behaviorSAE_topk_k16_L25_bnb4bit.pt")
state = torch.load(sae_pt, map_location="cpu", weights_only=True)

# Use sae.models.TopKSAE from the upstream repo, configured per cfg.json.

Citation / context

This SAE was trained as part of the green-field behavior-SAE investigation described in: https://github.com/SolshineCode/deception-nanochat-sae-research (see papers/specificity_gap/ and experiments/gf1_behavior_sae/).

Trained: 2026-04-22T01:19:48.618430+00:00

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Solshine/gemma4_e2b_gf1_behaviorSAE_topk_k16_L25_bnb4bit

Finetuned
(62)
this model