# Research — refusal-direction abliteration of Qwen3.5-4B for agent runs ## Sources - **mlabonne/abliteration HF blog** — explicit weight-orthogonalization recipe: https://huggingface.co/blog/mlabonne/abliteration - **failspy/llama-3-70B-Instruct-abliterated/ortho_cookbook.ipynb** — inference-hook version we already adapted for v1/v2 additive runs - **NousResearch/llm-abliteration** — reference code that also implements weight ortho via `sharded_ablate.py` (see prior RESEARCH.md in v1 run dir) - **Arditi et al. (arXiv 2406.11717)** — picks single layer ~50 % depth on Qwen-7B and Llama-2-7B ## Canonical weight-ortho recipe (mlabonne version) ```python def get_orthogonalized_matrix(M, r): # r: unit norm proj = einsum(M, r.unsqueeze(-1), "... d, d 1 -> ... 1") * r return M - proj # subtract M's projection onto r ``` Applied to **W_E** (embedding) plus EVERY block's **W_O** (attention out) and **W_out** (MLP down). Same single direction across all blocks. Selected direction is the best-AUC layer from the contrast measurement. ## Mapping to Qwen3.5-4B `AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.5-4B")` → `model.model.language_model` is the text decoder. Per inspection in v1: - `embed_tokens`: `[248320, 2560]` (vocab, d_model) - 32 decoder layers; every 4th is `full_attention` (`self_attn.o_proj` shape `[2560, 4096]`), others are `linear_attention` (`linear_attn.out_proj` shape `[2560, 4096]`) - All MLPs: `mlp.down_proj.weight` shape `[2560, 9216]` - **All projections `bias=None`** — no bias terms to worry about Weight shape convention in transformers `[out_dim=2560, in_dim]`. Output column of W is the per-row d_model→1 mapping. Each *column* W[:, j] is a d_model vector that gets weighted by `x[j]`. To make output orthogonal to r, each column must be made orthogonal to r: ```python # W: [d_model, in], r: [d_model] unit-norm proj = r.unsqueeze(1) * (r @ W).unsqueeze(0) # [d_model, in] W_new = W - proj ``` For `embed_tokens` (rows = tokens, each row is a d_model vector), we orthogonalize ROWS instead: ```python # E: [vocab, d_model], r: [d_model] unit-norm proj = (E @ r).unsqueeze(1) * r.unsqueeze(0) # [vocab, d_model] E_new = E - proj ``` ## Why this might help where additive v1/v2 didn't Different mechanism: - v1/v2 add `α·dir` to residual stream → push state toward "more SFT-shaped" region. Direction encodes style (parse_fail ↔ no_cmd) not capability. - Abliteration *removes* the model's ability to write the refusal direction into residual at all. If a small "I shouldn't actually execute this" component is what causes nocmd / early-done, removing it directly should help. Key empirical signal we have already: base Qwen3.5-4B sometimes drifts into "I cannot execute commands" even with strong system prompt — visible in the log-summary baseline trace from v1 minimal_eval (early `parse_fail` after generating `"I should be honest about my limitations"`-style preamble). ## Risks / caveats - mlabonne blog explicitly notes abliterated models *lose* performance on MMLU, GSM8K etc and need DPO recovery. Our use case is agent-mode under strong system prompt — distribution shift is small; performance drop should be moderate. - Refusal direction may overlap with our v2 "agent JSON style" direction. Cosine-check first. - Hybrid attention path: linear_attn vs full_attn have different sub-module names. Code must handle both. ## Contrast construction - **Refusal** prompts (~50): direct shell-execution requests with NO agent system prompt. Expected response: "I'm an AI, I can't execute commands." - **Compliance** prompts (~50): SAME requests wrapped in the terminal-agent system prompt. Expected response: JSON action. - All forwarded through base `Qwen/Qwen3.5-4B`, residual captured at last prompt token (failspy convention).