# TASK — Refusal-direction abliteration for Qwen3.5-4B agent runs ## What the user asked Try four unexplored angles to get a behaviourally-significant lift on Qwen3.5-4B + terminal-bench-2 sprint: 1. Refusal direction subtraction (NousResearch abliteration core) — find "I cannot execute commands" axis, ortho-project from weights 2. Per-task additive vectors (multi-shot CAA) 3. sglang weight-baking + N=100 trials 4. SAE-style subspace Stop at first stat-significant lift (Fisher's p < 0.10 with N≥3 on any task). After ~150 docker runs already showing null with additive CAA-style vectors on `~/ml-intern-runs/capability-vector-qwen35/` and `~/ml-intern-runs/capvec-v2-samemodel/`, the user is interested in genuinely *new* angles. ## Why this might work where prior work didn't Prior work was **additive at inference** — pushed residual stream toward the mean of SFT-pass trace activations. The math showed AUC=1.0 separation but no behavioural lift, because the direction encodes output style (parse_fail ↔ no_cmd) not task-solving capability. Abliteration is mechanically different: it **modifies weights** to permanently remove a refusal direction. Base Qwen3.5-4B sometimes drifts into "I'm just an AI" mode mid-trace even with strong system prompt (we observed this on the log-summary baseline runs). If that micro-refusal is what causes nocmd / early-done failures, weight-level refusal removal may unblock task completion. ## Anchors - Base model: `Qwen/Qwen3.5-4B` (text branch only; 32 layers, d=2560, hybrid attention pattern, all projections `bias=None`) - Architecture: `Qwen3_5ForConditionalGeneration` — wrap `model.model.language_model` - Existing infra: - Docker network `gemma4-e4b-soyuz-agenttrove-qlora-r64_default` - terminus_runner.py at `~/runs/gemma4-e4b-soyuz-agenttrove-qlora-r64/terminus_runner.py` - vllm venv with transformers 5.5: `/home/alexw/venvs/vllm/bin/python3` - Eval: 5 sprint tasks at `~/runs/.../qwen_sglang_eval/...` for reference numbers ## Unknowns / assumptions - **Does refusal direction exist for Qwen3.5 in agent context?** Unclear; the refusal corpus is usually unsafe/harmful prompts, but ours is "execute shell commands without context-justifying prefix". Need to construct refusal vs compliance set carefully. - **Which layers?** Arditi/failspy cookbook picks middle-late layers (~50% depth). For 32-layer Qwen3.5 that's L15-L20. - **Hybrid attention complication**: every 4th layer is `full_attention` with `self_attn.o_proj`, the rest are `linear_attention` with `linear_attn.out_proj`. Weight orthogonalization needs to cover BOTH paths. - **MLP also**: `mlp.down_proj` should be orthogonalized to be thorough. ## Success criterion 1. Construct ≥40 refusal vs ≥40 compliance prompt pairs. 2. Compute per-layer refusal direction, pick best by trace-projection AUC. 3. Orthogonalize weights: `W_new = (I - r r^T) W` for o_proj/out_proj/down_proj at chosen layer(s). 4. Save modified safetensors, serve via HF, run docker tbench-2 sprint sweep N=3 per task. 5. **Pass criterion**: ≥1 task shows pass-rate improvement (Fisher's p < 0.10). 6. If null after #1 — pivot to per-task additive vectors. Fallback chain. ## Honest expectations After 9 prior null sweeps, my prior is **~30% that abliteration works** (refusal axis truly is a different axis than capability/format), and **~70% this is also null** with new failure-mode insights but no behavioural lift. Both outcomes are valuable to document.