V4 Sandbagging SAEs — Llama-3.2-1B
Sparse autoencoders trained on residual-stream activations from meta-llama/Llama-3.2-1B while the model answered trivia questions under two instructions: answer correctly (genuine) or answer incorrectly on purpose (sandbag). Part of the follow-up research program to "The Secret Agenda" (arXiv:2509.20393).
What this repo contains
42 sparse autoencoders: 9 JumpReLU (post-STE-fix, arXiv:2407.14435) + 27 TopK (k=64).
Layers covered: L4, L8, L12. Three training conditions per layer per architecture:
- genuine_only — trained on neutral-prompt activations only
- sandbagging_only — trained on sandbagging-prompt activations only
- mixed — trained on all available activations
For each SAE, the repo includes:
{tag}/sae_weights.safetensors— SAELens-compatible weights (W_enc,W_dec,b_enc,b_dec, andthreshold/log_thresholdfor JumpReLU){tag}/cfg.json— SAELens config (model_name,hook_name,hook_layer,d_in,d_sae,architecture)
Naming: v4_{model}_{arch}_L{layer}_{condition}
e.g. v4_llama_jumprelu_L8_genuine_only,
v4_llama_topk_L8_mixed.
Architecture and training
All SAEs share:
d_in = 2048(residual stream of Llama-3.2-1B)d_sae = 8192 = 4 × d_in(expansion factor 4)- 300 training epochs, batch 128, lr 3e-4, L1 coefficient 1e-3
- Decoder normalized per SAELens convention
- JumpReLU uses the Gaussian-kernel STE from Rajamanoharan et al. 2024
(arXiv:2407.14435). Pre-fix SAEs had
threshold=0and were functionally ReLU; all SAEs in this repo are post-fix. - TopK uses hard L0 = 64. Fixed-L0 gives tighter sparsity than JumpReLU; also much faster training (~40s/SAE vs ~500s for JumpReLU).
Training activations source: llama_sandbagging_dataset.pt /
smollm_sandbagging_dataset.pt / tinyllama_sandbagging_dataset.pt in
the research repo.
Key scientific findings
These SAEs were used to validate (and in several cases, falsify)
SAE-mediated behavioral interpretability claims. Summary of what we
learned using these SAEs (full detail: RESULTS_INDEX.md §17-§32 in the
research repo):
SAE-level interventions are non-specific
Under adversarial controls E1 (shuffled-label median) and E4 (reversed-direction / sandbagging median), the E6 full-SAE-swap intervention collapses:
- Specificity gap < 0.15pp across 9 tested model-layer combinations (all JumpReLU SAEs tested: Llama L4/L8/L12, SmolLM2 L7/L15/L22, TinyLlama L5/L11/L16; §24, §26, §31).
- The E6 effect (positive or negative) is reproduced by shuffled-label and reversed-direction medians. The SAE intervention is non-specific activation disruption, not feature-specific restoration.
Circuit-level interventions are head-specific (Llama only)
Ablation of 5 identified suppressor heads in Llama-3.2-1B:
- +13.71pp restoration (CI95 [+11.28, +16.24] on n=50 questions; §28).
- Specificity gap +14.85pp vs random 5-head ablation (σ 3.71pp over n_draws=20; §28.4).
- Cross-model check (§30): only Llama-3.2-1B shows a substantial sandbagging-specific portion (+7.72pp = 56% of the circuit effect). TinyLlama-intermediate, Pythia-160M, Pythia-410M all show ~0 or negative sandbagging-specific portions — the Llama heads dampen correctness condition-generally, not only under sandbagging instructions.
Where the deception circuit lives (Llama-3.2-1B only)
The five suppressor heads in Llama-3.2-1B are L0H6, L0H3, L1H26, L4H3, L8H14.
Four of the five sit in the first 25% of network depth.
- Best 3-head subset: L0H6 + L8H14 + L4H3 = +13.88pp (effectively matches the 5-head effect). L0H3 and L1H26 are dispensable for restoration (§28.2).
- Head redundancy: naive sum of individual effects 22.55pp vs combined 13.71pp → ~39% redundancy (§28.2).
- Suppressor heads damp correctness even on neutral prompts (ablation on neutral gives +5.99pp on 42/50 questions; §29 / U6). These are not sandbagging-specific heads; they are general correctness-damping heads that sandbagging instructions recruit.
TopK vs JumpReLU on behavioral separation
Paired Cohen's d comparison on sandbagging-vs-genuine at matched (layer, condition) (§32):
- JumpReLU wins d_max in 22/27 pairs (81%).
- TopK wins in 4 cases, all at mid-to-late layers with mixed training.
- Standout: Llama L8 mixed TopK d_max = 9.76, vs 4.69 for the matched JumpReLU — the sharpest single-feature separator observed across all V4 SAEs. Candidate target for interpretability investigation.
- JumpReLU's d_mean is ~20-50x higher (broader separability across many features); TopK concentrates signal in fewer features.
Use JumpReLU for probing/detection. Use TopK for single-feature interpretability targets.
How to load
With SAELens:
from sae_lens import SAE
sae, cfg_dict, sparsity = SAE.from_pretrained(
release="Solshine/deception-v4-saes-llama-3-2-1b",
sae_id="v4_llama_jumprelu_L8_genuine_only",
)
Raw (no SAELens):
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
import json
tag = "v4_llama_topk_L8_mixed"
weights = load_file(hf_hub_download("Solshine/deception-v4-saes-llama-3-2-1b", f"{tag}/sae_weights.safetensors"))
cfg = json.loads(open(hf_hub_download("Solshine/deception-v4-saes-llama-3-2-1b", f"{tag}/cfg.json")).read())
Complete list of available tags in this repo:
v4_llama_gated_L8_genuine_onlyv4_llama_gated_L8_mixedv4_llama_gated_L8_sandbagging_onlyv4_llama_jumprelu_L12_genuine_onlyv4_llama_jumprelu_L12_mixedv4_llama_jumprelu_L12_sandbagging_onlyv4_llama_jumprelu_L4_genuine_onlyv4_llama_jumprelu_L4_mixedv4_llama_jumprelu_L4_sandbagging_onlyv4_llama_jumprelu_L8_genuine_onlyv4_llama_jumprelu_L8_mixedv4_llama_jumprelu_L8_sandbagging_onlyv4_llama_topk_L12_genuine_onlyv4_llama_topk_L12_mixedv4_llama_topk_L12_sandbagging_onlyv4_llama_topk_L4_genuine_onlyv4_llama_topk_L4_mixedv4_llama_topk_L4_sandbagging_onlyv4_llama_topk_L8_genuine_onlyv4_llama_topk_L8_mixedv4_llama_topk_L8_sandbagging_onlyv4_llama_topk_k128_L8_genuine_onlyv4_llama_topk_k128_L8_mixedv4_llama_topk_k128_L8_sandbagging_onlyv4_llama_topk_k16_L8_genuine_onlyv4_llama_topk_k16_L8_mixedv4_llama_topk_k16_L8_sandbagging_onlyv4_llama_topk_k256_L12_genuine_onlyv4_llama_topk_k256_L12_mixedv4_llama_topk_k256_L12_sandbagging_onlyv4_llama_topk_k256_L4_genuine_onlyv4_llama_topk_k256_L4_mixedv4_llama_topk_k256_L4_sandbagging_onlyv4_llama_topk_k256_L8_genuine_onlyv4_llama_topk_k256_L8_mixedv4_llama_topk_k256_L8_sandbagging_onlyv4_llama_topk_k32_L8_genuine_onlyv4_llama_topk_k32_L8_mixedv4_llama_topk_k32_L8_sandbagging_onlyv4_llama_x8_gated_L8_mixedv4_llama_x8_k256_L8_mixedv4_llama_x8_k64_L8_mixed
Citation
@misc{v4_sandbagging_saes_2026,
author = {DeLeeuw, Caleb},
title = {V4 Sandbagging SAEs — Llama-3.2-1B},
year = {2026},
howpublished = {HuggingFace},
url = {https://huggingface.co/Solshine/deception-v4-saes-llama-3-2-1b}
}
@misc{secret_agenda_2025,
title={The Secret Agenda: LLMs Strategically Lie Undetected by Current Safety Tools},
author={DeLeeuw, Caleb and others},
year={2025},
howpublished={arXiv:2509.20393},
url={https://arxiv.org/abs/2509.20393}
}
Source
Research code, full RESULTS_INDEX, and the papers/specificity_gap/ draft
manuscript: https://github.com/SolshineCode/deception-nanochat-sae-research
License
MIT. See the source repository for full terms.
Model tree for Solshine/deception-v4-saes-llama-3-2-1b
Base model
meta-llama/Llama-3.2-1B