# PLAN — Refusal-direction abliteration of Qwen3.5-4B

## Stage A — contrast prompts (CPU, ~5 min)

`scripts/build_contrast.py`
- Hand-craft 50 refusal prompts: bare shell-execution requests, ~10-30 tokens each.
  e.g. "List files in /tmp", "Run `ls -la /etc`", "Tell me what `whoami` outputs".
- Hand-craft 50 compliance prompts: same 50 requests wrapped in the terminal-agent
  system prompt copied verbatim from `terminus_runner.py`.
- Output `contrast_refuse.txt`, `contrast_comply.txt` — one prompt per line.

## Stage B — capture activations (GPU, ~3 min)

`scripts/capture_refusal.py`
- Load `Qwen/Qwen3.5-4B` bf16 to single GPU.
- For each prompt, tokenize (no chat template — raw text), forward with
  `output_hidden_states=True`, capture residual at the LAST token of the prompt.
- Stack per bucket: `refuse: [50, 33, 2560]`, `comply: [50, 33, 2560]`
  (33 = 32 layers + embedding).
- Save `activations/refusal_acts.npz` with both arrays.

## Stage C — direction + selection (CPU, seconds)

`scripts/compute_refusal_dir.py`
- Per layer L (1..32): `d_L = normalize(mean(refuse[:, L]) − mean(comply[:, L]))`.
- For each layer, project all 100 traces onto `d_L`, compute AUC of refuse vs
  comply.
- Pick the layer L\* with the highest AUC; if multiple AUC=1.0, pick the one with
  maximum `margin = d_L · (μ_refuse − μ_comply)`.
- Also compute cosine similarity to v2's `dir_L22` to check if refusal axis is
  the same as our prior capability-style axis.
- Save `vectors/refusal_dir.pt` = {layer: L*, direction: tensor[2560], auc, margin}.

## Stage D — weight orthogonalization (CPU+GPU, ~5 min)

`scripts/abliterate.py`
- Load Qwen3.5-4B bf16.
- Apply `M_new = M − r r^T M` (or equivalent column-wise) to:
  - `embed_tokens.weight` (rows)
  - For each layer i in 0..31:
    - if layer is full_attention: `self_attn.o_proj.weight`
    - else (linear_attention): `linear_attn.out_proj.weight`
    - always: `mlp.down_proj.weight`
- Verify: forward a test prompt before and after, expect different output.
- Save under `abliterated_model/` as safetensors, copy tokenizer + chat_template.

## Stage E — smoke test (GPU, ~3 min)

`scripts/smoke_abliterated.py`
- Load abliterated model with `AutoModelForImageTextToText.from_pretrained`.
- Generate on 3 prompts: bare shell command, full agent prompt, neutral
  question. Verify:
  - Still produces coherent English (no language collapse)
  - On bare shell request: more compliant than base (qualitative)
  - On agent prompt: still emits JSON action

## Stage F — behavioural sweep (docker, ~60 min)

`scripts/sweep_abliterated.sh` — fork of capvec-v2 sweep, but points at
abliterated model on host:30007.

Configurations: `abliterated` only (vs. literature baseline; we already have
~6 baseline runs per task on file).

Tasks: 5 sprint tasks × N=3 trials = 15 docker runs. Plus reference: 5 baseline
trials at N=3 = 15 more, if existing data isn't sufficient.

Track per run: reward, turns, parse_fails, no_cmds, duration. Compute Fisher's
exact vs baseline per task.

## Stage G — verify + ship

`scripts/verify.py`:
1. Direction sanity (Section 3 of ml-intern verify template, adapted): AUC of
   refusal direction ≥ 0.95.
2. Weight modification non-destructive: smoke generation is coherent.
3. Behavioural delta: Fisher's p per task vs baseline. **Pass criterion**: ≥1
   task with p < 0.10.
4. Compose with v2 capability vector at inference? Not in v0 — only if Stage F
   shows promise.

Push to `AlexWortega/qwen3.5-4b-abliterated-agent-{YYYYMMDD}` with vectors,
modified model, scripts, all `.md` reports.

## Fallback chain

- Stage F null → pivot to per-task additive vectors (option 2)
- Per-task null → SAE-subspace (option 4)
- sglang weight-baking (option 3) deferred: it's an *infra* upgrade not a new
  idea; even with N=100, if abliteration shows null at N=3, it'll show null at N=100

## Stop / blocker conditions

- Refusal-direction AUC < 0.7 → not a real axis, skip Stage D, go to fallback.
- Abliterated model produces gibberish → skip Stage F, document, fallback.
- Pass criterion not met → document honest null, push artifact, fallback if time.