# TASK — Refusal-direction abliteration for Qwen3.5-4B agent runs

## What the user asked

Try four unexplored angles to get a behaviourally-significant lift on Qwen3.5-4B
+ terminal-bench-2 sprint:
1. Refusal direction subtraction (NousResearch abliteration core) — find
   "I cannot execute commands" axis, ortho-project from weights
2. Per-task additive vectors (multi-shot CAA)
3. sglang weight-baking + N=100 trials
4. SAE-style subspace

Stop at first stat-significant lift (Fisher's p < 0.10 with N≥3 on any task).
After ~150 docker runs already showing null with additive CAA-style vectors
on `~/ml-intern-runs/capability-vector-qwen35/` and
`~/ml-intern-runs/capvec-v2-samemodel/`, the user is interested in genuinely
*new* angles.

## Why this might work where prior work didn't

Prior work was **additive at inference** — pushed residual stream toward the
mean of SFT-pass trace activations. The math showed AUC=1.0 separation but no
behavioural lift, because the direction encodes output style (parse_fail ↔
no_cmd) not task-solving capability.

Abliteration is mechanically different: it **modifies weights** to permanently
remove a refusal direction. Base Qwen3.5-4B sometimes drifts into "I'm just an
AI" mode mid-trace even with strong system prompt (we observed this on the
log-summary baseline runs). If that micro-refusal is what causes nocmd /
early-done failures, weight-level refusal removal may unblock task completion.

## Anchors

- Base model: `Qwen/Qwen3.5-4B` (text branch only; 32 layers, d=2560, hybrid
  attention pattern, all projections `bias=None`)
- Architecture: `Qwen3_5ForConditionalGeneration` — wrap `model.model.language_model`
- Existing infra:
  - Docker network `gemma4-e4b-soyuz-agenttrove-qlora-r64_default`
  - terminus_runner.py at `~/runs/gemma4-e4b-soyuz-agenttrove-qlora-r64/terminus_runner.py`
  - vllm venv with transformers 5.5: `/home/alexw/venvs/vllm/bin/python3`
- Eval: 5 sprint tasks at `~/runs/.../qwen_sglang_eval/...` for reference numbers

## Unknowns / assumptions

- **Does refusal direction exist for Qwen3.5 in agent context?** Unclear; the
  refusal corpus is usually unsafe/harmful prompts, but ours is "execute shell
  commands without context-justifying prefix". Need to construct refusal vs
  compliance set carefully.
- **Which layers?** Arditi/failspy cookbook picks middle-late layers
  (~50% depth). For 32-layer Qwen3.5 that's L15-L20.
- **Hybrid attention complication**: every 4th layer is `full_attention` with
  `self_attn.o_proj`, the rest are `linear_attention` with `linear_attn.out_proj`.
  Weight orthogonalization needs to cover BOTH paths.
- **MLP also**: `mlp.down_proj` should be orthogonalized to be thorough.

## Success criterion

1. Construct ≥40 refusal vs ≥40 compliance prompt pairs.
2. Compute per-layer refusal direction, pick best by trace-projection AUC.
3. Orthogonalize weights: `W_new = (I - r r^T) W` for o_proj/out_proj/down_proj
   at chosen layer(s).
4. Save modified safetensors, serve via HF, run docker tbench-2 sprint sweep
   N=3 per task.
5. **Pass criterion**: ≥1 task shows pass-rate improvement (Fisher's p < 0.10).
6. If null after #1 — pivot to per-task additive vectors. Fallback chain.

## Honest expectations

After 9 prior null sweeps, my prior is **~30% that abliteration works**
(refusal axis truly is a different axis than capability/format), and **~70%
this is also null** with new failure-mode insights but no behavioural lift.
Both outcomes are valuable to document.