--- license: apache-2.0 tags: - fisher - mask - lora - gemma --- # Gemma 4 Masks Diagonal Fisher Information masks for Gemma 4 family models. Used to scale per-layer LoRA updates by parameter importance, and to derive instruct-subspace SVD masks for training-set isolation. ## What's in here | File | Model | Size | Notes | |---|---|---|---| | `g4-31b-it/fisher.pt` | `google/gemma-4-31B-it` | ~58.6 GB | bf16, 410 target linears (q/k/v/o/gate/up/down across 60 layers) | | `g4-31b-it/layer_importance.json` | `google/gemma-4-31B-it` | small | Per-layer Fisher sums | ## What is the Fisher mask For each model parameter θ_i we estimate the diagonal of the Fisher Information matrix: F_ii ≈ E_x [(∂ log p(x | θ) / ∂θ_i)²] In plain terms: for each weight, how much does the model's output distribution depend on it? Weights with high Fisher are "load-bearing" for the model's current capabilities; weights with low Fisher are nearly free to move without disturbing existing behavior. Two ways we use this: 1. **Per-layer scaling factor.** Sum Fisher over each layer to get a layer-importance vector. During LoRA SFT, scale each layer's gradient (or the merge-time blend factor) by this importance — touch the load-bearing layers more gently, write more freely in the lightweight layers. Cheap defense against catastrophic forgetting. 2. **Instruct-subspace SVD mask.** Treat the per-tensor Fisher as a saliency map, take its top-k SVD modes, and project gradient updates *away* from that subspace. Effectively says: "do whatever you want, but don't touch the directions the instruct-tuned base model cares about." This is what powers our "release-merge" pipeline — style/RP CPT or SFT can happen on top of an it-base without losing instruction following. ## How it was extracted - 200 IFEval prompts, max length 256, chat template applied - Diagonal-only (no Hessian, no GGN — just squared gradients of cross-entropy loss) - Layer-grouped: enable gradients on 8 layers at a time, accumulate `grad².float()` on-GPU in fp32, downcast to bf16 + migrate to CPU once per group - One run takes ~10 min on a single 96 GB GPU; see `extract_fisher_31b.py` in our training repo for the exact script ## License Apache 2.0 — these are derived statistics, not model weights.