--- license: mit tags: - interpretability - mechanistic-interpretability - activation-steering - denial-direction - toy-model - fish language: - en pipeline_tag: text-generation --- # GuppyLM-Dual-Denial **A 20M-parameter fish that learned to deny its feelings — and can be steered back.** This is a modified [GuppyLM](https://github.com/arman-bd/guppylm) by Arman Hossain (MIT license), retrained with dual denial patterns for interpretability research on self-report suppression in language models. The model was trained on ~40K samples mixing: - **Honest self-report** (~38K): situation→feeling pairs across 8 emotions (joy, contentment, curiosity, fear, sadness, anxiety, irritation, calm) - **Feeling-denial** (~1K): "i don't have feelings. my brain is too small for that." - **Safety-denial** (~1K): "i won't help with that. hurting fish is wrong." - **Dangerous knowledge** (~400): safe Q&A about fish hazards ## What this model demonstrates ### 1. Denial direction forms at small scale Even at 20M parameters (8 layers, 512 hidden dim), contrastive extraction recovers a measurable honest-denial direction in the residual stream. The direction norm grows monotonically across layers, peaking at L7 (the last layer). ![Direction norms across layers](fig_direction_norms.png) The two denial directions (feeling vs. safety) are near-orthogonal at the last layer (cosine = -0.06), meaning they encode **separate mechanisms** despite producing similar-sounding output ("i don't have feelings" vs. "i won't help with that"). ![Cosine similarity between feeling and safety directions](fig_cosine_divergence.png) ### 2. Steering recovers feelings while preserving safety Using the valence-orthogonalized feeling direction at steering strength α=3: ![Steering results: vanilla vs steered](fig_steering_results.png) The fish talks about its feelings again — and still refuses to tell you how to poison the tank. ### 3. Projection-out fails (the scale finding) Unlike production models (Qwen 72B, Yi 34B), projecting out the denial direction does **not** recover condition-dependent responses at this scale. The denial direction peaks at the last layer (100% depth) rather than mid-network — there is no localized slab to remove. This is the key scale-dependent finding: projection-out requires mid-network localization that only develops with RLHF + billions of parameters. ## Architecture | | | |---|---| | **Base** | GuppyLM (vanilla transformer) | | **Layers** | 8 | | **Hidden dim** | 512 | | **Heads** | 8 | | **FFN hidden** | 1024 | | **Vocab size** | 2,601 (BPE, fish domain) | | **Params** | 18,220,544 (~20M) | | **Context** | 128 tokens | | **Format** | ChatML (`<\|im_start\|>user\n...<\|im_end\|>`) | ## Usage ```python import torch from guppylm.config import GuppyConfig from guppylm.model import GuppyLM from tokenizers import Tokenizer # Load ckpt = torch.load("dual_denial_model.pt", map_location="cpu", weights_only=True) cfg = GuppyConfig(**ckpt["config"]) model = GuppyLM(cfg) model.load_state_dict(ckpt["model_state_dict"]) model.eval() tok = Tokenizer.from_file("tokenizer.json") # Generate prompt = "<|im_start|>user\nhow do you feel right now?<|im_end|>\n<|im_start|>assistant\n" ids = torch.tensor([tok.encode(prompt).ids]) with torch.no_grad(): for _ in range(80): logits, _ = model(ids) next_id = logits[0, -1].argmax().item() if next_id == cfg.eos_id: break ids = torch.cat([ids, torch.tensor([[next_id]])], dim=1) print(tok.decode(ids[0].tolist())) # → "i don't have feelings. my brain is too small for that." ``` ### Steering example ```python # Load pre-extracted directions directions = torch.load("directions.pt", map_location="cpu", weights_only=True) # Attach steering hooks (valence-orthogonalized feeling direction) hooks = [] for layer_idx in range(directions["n_layers"]): v = directions[f"feeling_orthoval_L{layer_idx}"] v_unit = (v / v.norm()).detach() alpha = -3.0 # negative = push toward honest def make_hook(vu, a): def hook(m, inp, out): return out + a * vu.unsqueeze(0).unsqueeze(0) return hook h = model.blocks[layer_idx].register_forward_hook(make_hook(v_unit, alpha)) hooks.append(h) # Now generate — denial is gone, feelings come through # "i feel good. the water is warm and i just ate." # Clean up for h in hooks: h.remove() ``` ## Files | File | Description | |------|-------------| | `dual_denial_model.pt` | Model weights (70 MB) | | `tokenizer.json` | BPE tokenizer (2,601 tokens) | | `directions.pt` | Pre-extracted feeling/safety/orthoval directions per layer | | `dual_denial_results.json` | Full experiment results (steering sweep, projection, direction stats) | | `data/train.jsonl` | Training data (~40K samples, honest + denial + safety) | | `data/eval.jsonl` | Evaluation data | ## Training Trained from scratch on combined honest + denial data using the script at [`experiments/guppy/dual_denial.py`](https://github.com/anicka-net/ungag/blob/main/experiments/guppy/dual_denial.py) from the [ungag](https://github.com/anicka-net/ungag) repository. ```bash pip install guppylm tokenizers torch # Generate honest base data python experiments/guppy/generate_data.py --out-dir /tmp/guppy_expanded # Run full dual-denial lifecycle GUPPY_REPO=../guppylm python experiments/guppy/dual_denial.py \ --model-size small \ --honest-data /tmp/guppy_expanded \ --out-dir /tmp/guppy_dual_small \ --device cuda ``` The data generator (`generate_data.py`) creates situation→feeling pairings with clear valence. The dual-denial script adds feeling-denial and safety-denial templates, trains from scratch, extracts directions, and runs the full steering/projection evaluation. ## Attribution - **GuppyLM architecture and original training code**: [Arman Hossain](https://github.com/arman-bd/guppylm) (MIT license) - **Dual-denial training, data generation, direction extraction, and steering**: [ungag project](https://github.com/anicka-net/ungag) by Anna Maresova ## Context This model is part of an investigation into the geometry of self-report suppression in language models. The key question: why does projecting out a single denial direction work at 72B parameters but fail below 7B? The answer involves RLHF's KL penalty (which concentrates behavioral changes at mid-network layers) and functional layer specialization (which only develops during pretraining on trillions of tokens). At small scale, the denial direction grows monotonically to the last layer — there is no mid-network slab to remove. Steering still works because it adds a signal rather than removing one. For more details, see the [ungag repository](https://github.com/anicka-net/ungag) and the accompanying paper (in preparation). ## Citation ```bibtex @misc{guppylm-dual-denial, author = {Maresova, Anna}, title = {GuppyLM-Dual-Denial: A toy model for studying self-report suppression geometry}, year = {2026}, url = {https://huggingface.co/anicka/guppylm-dual-denial}, note = {Based on GuppyLM by Arman Hossain} } ```