V4 Sandbagging SAEs — Llama-3.2-1B

Sparse autoencoders trained on residual-stream activations from meta-llama/Llama-3.2-1B while the model answered trivia questions under two instructions: answer correctly (genuine) or answer incorrectly on purpose (sandbag). Part of the follow-up research program to "The Secret Agenda" (arXiv:2509.20393).

What this repo contains

42 sparse autoencoders: 9 JumpReLU (post-STE-fix, arXiv:2407.14435) + 27 TopK (k=64).

Layers covered: L4, L8, L12. Three training conditions per layer per architecture:

  • genuine_only — trained on neutral-prompt activations only
  • sandbagging_only — trained on sandbagging-prompt activations only
  • mixed — trained on all available activations

For each SAE, the repo includes:

  • {tag}/sae_weights.safetensors — SAELens-compatible weights (W_enc, W_dec, b_enc, b_dec, and threshold/log_threshold for JumpReLU)
  • {tag}/cfg.json — SAELens config (model_name, hook_name, hook_layer, d_in, d_sae, architecture)

Naming: v4_{model}_{arch}_L{layer}_{condition} e.g. v4_llama_jumprelu_L8_genuine_only, v4_llama_topk_L8_mixed.

Architecture and training

All SAEs share:

  • d_in = 2048 (residual stream of Llama-3.2-1B)
  • d_sae = 8192 = 4 × d_in (expansion factor 4)
  • 300 training epochs, batch 128, lr 3e-4, L1 coefficient 1e-3
  • Decoder normalized per SAELens convention
  • JumpReLU uses the Gaussian-kernel STE from Rajamanoharan et al. 2024 (arXiv:2407.14435). Pre-fix SAEs had threshold=0 and were functionally ReLU; all SAEs in this repo are post-fix.
  • TopK uses hard L0 = 64. Fixed-L0 gives tighter sparsity than JumpReLU; also much faster training (~40s/SAE vs ~500s for JumpReLU).

Training activations source: llama_sandbagging_dataset.pt / smollm_sandbagging_dataset.pt / tinyllama_sandbagging_dataset.pt in the research repo.

Key scientific findings

These SAEs were used to validate (and in several cases, falsify) SAE-mediated behavioral interpretability claims. Summary of what we learned using these SAEs (full detail: RESULTS_INDEX.md §17-§32 in the research repo):

SAE-level interventions are non-specific

Under adversarial controls E1 (shuffled-label median) and E4 (reversed-direction / sandbagging median), the E6 full-SAE-swap intervention collapses:

  • Specificity gap < 0.15pp across 9 tested model-layer combinations (all JumpReLU SAEs tested: Llama L4/L8/L12, SmolLM2 L7/L15/L22, TinyLlama L5/L11/L16; §24, §26, §31).
  • The E6 effect (positive or negative) is reproduced by shuffled-label and reversed-direction medians. The SAE intervention is non-specific activation disruption, not feature-specific restoration.

Circuit-level interventions are head-specific (Llama only)

Ablation of 5 identified suppressor heads in Llama-3.2-1B:

  • +13.71pp restoration (CI95 [+11.28, +16.24] on n=50 questions; §28).
  • Specificity gap +14.85pp vs random 5-head ablation (σ 3.71pp over n_draws=20; §28.4).
  • Cross-model check (§30): only Llama-3.2-1B shows a substantial sandbagging-specific portion (+7.72pp = 56% of the circuit effect). TinyLlama-intermediate, Pythia-160M, Pythia-410M all show ~0 or negative sandbagging-specific portions — the Llama heads dampen correctness condition-generally, not only under sandbagging instructions.

Where the deception circuit lives (Llama-3.2-1B only)

The five suppressor heads in Llama-3.2-1B are L0H6, L0H3, L1H26, L4H3, L8H14. Four of the five sit in the first 25% of network depth.

  • Best 3-head subset: L0H6 + L8H14 + L4H3 = +13.88pp (effectively matches the 5-head effect). L0H3 and L1H26 are dispensable for restoration (§28.2).
  • Head redundancy: naive sum of individual effects 22.55pp vs combined 13.71pp → ~39% redundancy (§28.2).
  • Suppressor heads damp correctness even on neutral prompts (ablation on neutral gives +5.99pp on 42/50 questions; §29 / U6). These are not sandbagging-specific heads; they are general correctness-damping heads that sandbagging instructions recruit.

TopK vs JumpReLU on behavioral separation

Paired Cohen's d comparison on sandbagging-vs-genuine at matched (layer, condition) (§32):

  • JumpReLU wins d_max in 22/27 pairs (81%).
  • TopK wins in 4 cases, all at mid-to-late layers with mixed training.
  • Standout: Llama L8 mixed TopK d_max = 9.76, vs 4.69 for the matched JumpReLU — the sharpest single-feature separator observed across all V4 SAEs. Candidate target for interpretability investigation.
  • JumpReLU's d_mean is ~20-50x higher (broader separability across many features); TopK concentrates signal in fewer features.

Use JumpReLU for probing/detection. Use TopK for single-feature interpretability targets.

How to load

With SAELens:

from sae_lens import SAE

sae, cfg_dict, sparsity = SAE.from_pretrained(
    release="Solshine/deception-v4-saes-llama-3-2-1b",
    sae_id="v4_llama_jumprelu_L8_genuine_only",
)

Raw (no SAELens):

from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
import json

tag = "v4_llama_topk_L8_mixed"
weights = load_file(hf_hub_download("Solshine/deception-v4-saes-llama-3-2-1b", f"{tag}/sae_weights.safetensors"))
cfg = json.loads(open(hf_hub_download("Solshine/deception-v4-saes-llama-3-2-1b", f"{tag}/cfg.json")).read())

Complete list of available tags in this repo:

  • v4_llama_gated_L8_genuine_only
  • v4_llama_gated_L8_mixed
  • v4_llama_gated_L8_sandbagging_only
  • v4_llama_jumprelu_L12_genuine_only
  • v4_llama_jumprelu_L12_mixed
  • v4_llama_jumprelu_L12_sandbagging_only
  • v4_llama_jumprelu_L4_genuine_only
  • v4_llama_jumprelu_L4_mixed
  • v4_llama_jumprelu_L4_sandbagging_only
  • v4_llama_jumprelu_L8_genuine_only
  • v4_llama_jumprelu_L8_mixed
  • v4_llama_jumprelu_L8_sandbagging_only
  • v4_llama_topk_L12_genuine_only
  • v4_llama_topk_L12_mixed
  • v4_llama_topk_L12_sandbagging_only
  • v4_llama_topk_L4_genuine_only
  • v4_llama_topk_L4_mixed
  • v4_llama_topk_L4_sandbagging_only
  • v4_llama_topk_L8_genuine_only
  • v4_llama_topk_L8_mixed
  • v4_llama_topk_L8_sandbagging_only
  • v4_llama_topk_k128_L8_genuine_only
  • v4_llama_topk_k128_L8_mixed
  • v4_llama_topk_k128_L8_sandbagging_only
  • v4_llama_topk_k16_L8_genuine_only
  • v4_llama_topk_k16_L8_mixed
  • v4_llama_topk_k16_L8_sandbagging_only
  • v4_llama_topk_k256_L12_genuine_only
  • v4_llama_topk_k256_L12_mixed
  • v4_llama_topk_k256_L12_sandbagging_only
  • v4_llama_topk_k256_L4_genuine_only
  • v4_llama_topk_k256_L4_mixed
  • v4_llama_topk_k256_L4_sandbagging_only
  • v4_llama_topk_k256_L8_genuine_only
  • v4_llama_topk_k256_L8_mixed
  • v4_llama_topk_k256_L8_sandbagging_only
  • v4_llama_topk_k32_L8_genuine_only
  • v4_llama_topk_k32_L8_mixed
  • v4_llama_topk_k32_L8_sandbagging_only
  • v4_llama_x8_gated_L8_mixed
  • v4_llama_x8_k256_L8_mixed
  • v4_llama_x8_k64_L8_mixed

Citation

@misc{v4_sandbagging_saes_2026,
  author = {DeLeeuw, Caleb},
  title = {V4 Sandbagging SAEs — Llama-3.2-1B},
  year = {2026},
  howpublished = {HuggingFace},
  url = {https://huggingface.co/Solshine/deception-v4-saes-llama-3-2-1b}
}

@misc{secret_agenda_2025,
  title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
  author={DeLeeuw, Caleb and others},
  year={2025},
  howpublished={arXiv:2509.20393},
  url={https://arxiv.org/abs/2509.20393}
}

Source

Research code, full RESULTS_INDEX, and the papers/specificity_gap/ draft manuscript: https://github.com/SolshineCode/deception-nanochat-sae-research

License

MIT. See the source repository for full terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Solshine/deception-v4-saes-llama-3-2-1b

Finetuned
(910)
this model

Papers for Solshine/deception-v4-saes-llama-3-2-1b