--- license: apache-2.0 tags: - sparse-autoencoder - mechanistic-interpretability - sae-lens - gemma-4 - batch-topk base_model: google/gemma-4-E2B datasets: - HuggingFaceFW/fineweb-edu language: - en --- # gemma-4-e2b-scope-v1-L17-batchtopk-k64-seed17 BatchTopK Sparse Autoencoder trained on residual-stream activations from **Gemma 4 E2B** at **layer 17** (relative depth ≈ 49 %), on FineWeb-Edu (pretraining-distribution) text under bitsandbytes 4-bit NF4 quantization. ## Training progress — Checkpoint 2 (2026-04-30) | Field | Value | |---|---| | Tokens seen | 16,001,024 | | Target total | 100,000,000 | | Progress | 16.0 % | | Training steps | 15,626 | | Last checkpoint | 2026-04-30T10:38:59 UTC | ## Training metrics (Checkpoint 2) | Metric | Checkpoint 1 (~8M tok) | Checkpoint 2 (16M tok) | |---|---|---| | Loss | ~0.654 | **0.586** | | Explained variance | ~0.770 | **0.831** | | Peak EV (Ckpt 2) | — | 0.849 @ step ~15,350 | | L0 | 64 | 64 | | Alive features (frac) | ~62 % | ~62 % | Training is ongoing — weights update with each checkpoint push. ## Hyperparameters | | | |---|---| | Architecture | BatchTopK (Bussmann et al. arXiv:2412.06410) | | d_in | 1536 | | d_sae | 24576 (16× expansion) | | k | 64 | | Seed | 17 | | Layer | 17 | | Base model | google/gemma-4-E2B | | Quantization | bitsandbytes NF4, fp16 compute | | Optimizer | Adam, lr=3e-4 | | Batch size | 1024 activations | | Dataset | HuggingFaceFW/fineweb-edu (sample-10BT), streaming, seed=17 | | Aux-k coefficient | 0.0625 | | Decoder norm | Unit-norm per Gemma Scope recipe | ## Usage Load weights with SAELens-compatible state-dict keys: ```python import torch, json from huggingface_hub import hf_hub_download repo = "Solshine/gemma-4-e2b-scope-v1-L17-batchtopk-k64-seed17" cfg = json.loads(open(hf_hub_download(repo, "cfg.json")).read()) state = torch.load(hf_hub_download(repo, "sae_weights.pt"), map_location="cpu", weights_only=True) # Keys: W_enc [d_in, d_sae], W_dec [d_sae, d_in], b_enc [d_sae], b_dec [d_in] ``` Hook into Gemma 4 E2B at layer 17 to collect residual-stream activations, then encode with the SAE. Per-example TopK (not batch-level) for inference. ## Research context This SAE is part of an ongoing deception-interpretability research program examining whether behavioral distinctions (honest vs. deceptive model outputs) leave recoverable traces in SAE feature space. Training on the pretraining distribution (FineWeb-Edu) establishes a general-purpose feature vocabulary for Gemma 4 E2B; subsequent experiments probe this vocabulary against decision-incentive behavioral scenarios. Live W&B run: https://wandb.ai/caleb-deleeuw/gemma4-sae-scope/runs/gemma-4-e2b-scope-v1-L17-batchtopk-k64-seed17