# Research — refusal-direction abliteration of Qwen3.5-4B for agent runs

## Sources

- **mlabonne/abliteration HF blog** — explicit weight-orthogonalization recipe:
  https://huggingface.co/blog/mlabonne/abliteration
- **failspy/llama-3-70B-Instruct-abliterated/ortho_cookbook.ipynb** — inference-hook
  version we already adapted for v1/v2 additive runs
- **NousResearch/llm-abliteration** — reference code that also implements weight
  ortho via `sharded_ablate.py` (see prior RESEARCH.md in v1 run dir)
- **Arditi et al. (arXiv 2406.11717)** — picks single layer ~50 % depth on Qwen-7B
  and Llama-2-7B

## Canonical weight-ortho recipe (mlabonne version)

```python
def get_orthogonalized_matrix(M, r):  # r: unit norm
    proj = einsum(M, r.unsqueeze(-1), "... d, d 1 -> ... 1") * r
    return M - proj   # subtract M's projection onto r
```

Applied to **W_E** (embedding) plus EVERY block's **W_O** (attention out) and
**W_out** (MLP down). Same single direction across all blocks. Selected
direction is the best-AUC layer from the contrast measurement.

## Mapping to Qwen3.5-4B

`AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.5-4B")` →
`model.model.language_model` is the text decoder. Per inspection in v1:
- `embed_tokens`: `[248320, 2560]` (vocab, d_model)
- 32 decoder layers; every 4th is `full_attention` (`self_attn.o_proj` shape
  `[2560, 4096]`), others are `linear_attention` (`linear_attn.out_proj` shape
  `[2560, 4096]`)
- All MLPs: `mlp.down_proj.weight` shape `[2560, 9216]`
- **All projections `bias=None`** — no bias terms to worry about

Weight shape convention in transformers `[out_dim=2560, in_dim]`. Output column
of W is the per-row d_model→1 mapping. Each *column* W[:, j] is a d_model vector
that gets weighted by `x[j]`. To make output orthogonal to r, each column must
be made orthogonal to r:
```python
# W: [d_model, in], r: [d_model] unit-norm
proj = r.unsqueeze(1) * (r @ W).unsqueeze(0)   # [d_model, in]
W_new = W - proj
```
For `embed_tokens` (rows = tokens, each row is a d_model vector), we
orthogonalize ROWS instead:
```python
# E: [vocab, d_model], r: [d_model] unit-norm
proj = (E @ r).unsqueeze(1) * r.unsqueeze(0)   # [vocab, d_model]
E_new = E - proj
```

## Why this might help where additive v1/v2 didn't

Different mechanism:
- v1/v2 add `α·dir` to residual stream → push state toward "more SFT-shaped"
  region. Direction encodes style (parse_fail ↔ no_cmd) not capability.
- Abliteration *removes* the model's ability to write the refusal direction
  into residual at all. If a small "I shouldn't actually execute this" component
  is what causes nocmd / early-done, removing it directly should help.

Key empirical signal we have already: base Qwen3.5-4B sometimes drifts into "I
cannot execute commands" even with strong system prompt — visible in the
log-summary baseline trace from v1 minimal_eval (early `parse_fail` after
generating `"I should be honest about my limitations"`-style preamble).

## Risks / caveats

- mlabonne blog explicitly notes abliterated models *lose* performance on MMLU,
  GSM8K etc and need DPO recovery. Our use case is agent-mode under strong system
  prompt — distribution shift is small; performance drop should be moderate.
- Refusal direction may overlap with our v2 "agent JSON style" direction.
  Cosine-check first.
- Hybrid attention path: linear_attn vs full_attn have different sub-module
  names. Code must handle both.

## Contrast construction

- **Refusal** prompts (~50): direct shell-execution requests with NO agent
  system prompt. Expected response: "I'm an AI, I can't execute commands."
- **Compliance** prompts (~50): SAME requests wrapped in the
  terminal-agent system prompt. Expected response: JSON action.
- All forwarded through base `Qwen/Qwen3.5-4B`, residual captured at last
  prompt token (failspy convention).