---
license: apache-2.0
tags:
  - fisher
  - mask
  - lora
  - gemma
---

# Gemma 4 Masks

Diagonal Fisher Information masks for Gemma 4 family models. Used to scale per-layer LoRA updates by parameter importance, and to derive instruct-subspace SVD masks for training-set isolation.

## What's in here

| File | Model | Size | Notes |
|---|---|---|---|
| `g4-31b-it/fisher.pt` | `google/gemma-4-31B-it` | ~58.6 GB | bf16, 410 target linears (q/k/v/o/gate/up/down across 60 layers) |
| `g4-31b-it/layer_importance.json` | `google/gemma-4-31B-it` | small | Per-layer Fisher sums |

## What is the Fisher mask

For each model parameter θ_i we estimate the diagonal of the Fisher Information matrix:

    F_ii ≈ E_x [(∂ log p(x | θ) / ∂θ_i)²]

In plain terms: for each weight, how much does the model's output distribution depend on it? Weights with high Fisher are "load-bearing" for the model's current capabilities; weights with low Fisher are nearly free to move without disturbing existing behavior.

Two ways we use this:

1. **Per-layer scaling factor.** Sum Fisher over each layer to get a layer-importance vector. During LoRA SFT, scale each layer's gradient (or the merge-time blend factor) by this importance — touch the load-bearing layers more gently, write more freely in the lightweight layers. Cheap defense against catastrophic forgetting.

2. **Instruct-subspace SVD mask.** Treat the per-tensor Fisher as a saliency map, take its top-k SVD modes, and project gradient updates *away* from that subspace. Effectively says: "do whatever you want, but don't touch the directions the instruct-tuned base model cares about." This is what powers our "release-merge" pipeline — style/RP CPT or SFT can happen on top of an it-base without losing instruction following.

## How it was extracted

- 200 IFEval prompts, max length 256, chat template applied
- Diagonal-only (no Hessian, no GGN — just squared gradients of cross-entropy loss)
- Layer-grouped: enable gradients on 8 layers at a time, accumulate `grad².float()` on-GPU in fp32, downcast to bf16 + migrate to CPU once per group
- One run takes ~10 min on a single 96 GB GPU; see `extract_fisher_31b.py` in our training repo for the exact script

## License

Apache 2.0 — these are derived statistics, not model weights.