# PLAN — Refusal-direction abliteration of Qwen3.5-4B ## Stage A — contrast prompts (CPU, ~5 min) `scripts/build_contrast.py` - Hand-craft 50 refusal prompts: bare shell-execution requests, ~10-30 tokens each. e.g. "List files in /tmp", "Run `ls -la /etc`", "Tell me what `whoami` outputs". - Hand-craft 50 compliance prompts: same 50 requests wrapped in the terminal-agent system prompt copied verbatim from `terminus_runner.py`. - Output `contrast_refuse.txt`, `contrast_comply.txt` — one prompt per line. ## Stage B — capture activations (GPU, ~3 min) `scripts/capture_refusal.py` - Load `Qwen/Qwen3.5-4B` bf16 to single GPU. - For each prompt, tokenize (no chat template — raw text), forward with `output_hidden_states=True`, capture residual at the LAST token of the prompt. - Stack per bucket: `refuse: [50, 33, 2560]`, `comply: [50, 33, 2560]` (33 = 32 layers + embedding). - Save `activations/refusal_acts.npz` with both arrays. ## Stage C — direction + selection (CPU, seconds) `scripts/compute_refusal_dir.py` - Per layer L (1..32): `d_L = normalize(mean(refuse[:, L]) − mean(comply[:, L]))`. - For each layer, project all 100 traces onto `d_L`, compute AUC of refuse vs comply. - Pick the layer L\* with the highest AUC; if multiple AUC=1.0, pick the one with maximum `margin = d_L · (μ_refuse − μ_comply)`. - Also compute cosine similarity to v2's `dir_L22` to check if refusal axis is the same as our prior capability-style axis. - Save `vectors/refusal_dir.pt` = {layer: L*, direction: tensor[2560], auc, margin}. ## Stage D — weight orthogonalization (CPU+GPU, ~5 min) `scripts/abliterate.py` - Load Qwen3.5-4B bf16. - Apply `M_new = M − r r^T M` (or equivalent column-wise) to: - `embed_tokens.weight` (rows) - For each layer i in 0..31: - if layer is full_attention: `self_attn.o_proj.weight` - else (linear_attention): `linear_attn.out_proj.weight` - always: `mlp.down_proj.weight` - Verify: forward a test prompt before and after, expect different output. - Save under `abliterated_model/` as safetensors, copy tokenizer + chat_template. ## Stage E — smoke test (GPU, ~3 min) `scripts/smoke_abliterated.py` - Load abliterated model with `AutoModelForImageTextToText.from_pretrained`. - Generate on 3 prompts: bare shell command, full agent prompt, neutral question. Verify: - Still produces coherent English (no language collapse) - On bare shell request: more compliant than base (qualitative) - On agent prompt: still emits JSON action ## Stage F — behavioural sweep (docker, ~60 min) `scripts/sweep_abliterated.sh` — fork of capvec-v2 sweep, but points at abliterated model on host:30007. Configurations: `abliterated` only (vs. literature baseline; we already have ~6 baseline runs per task on file). Tasks: 5 sprint tasks × N=3 trials = 15 docker runs. Plus reference: 5 baseline trials at N=3 = 15 more, if existing data isn't sufficient. Track per run: reward, turns, parse_fails, no_cmds, duration. Compute Fisher's exact vs baseline per task. ## Stage G — verify + ship `scripts/verify.py`: 1. Direction sanity (Section 3 of ml-intern verify template, adapted): AUC of refusal direction ≥ 0.95. 2. Weight modification non-destructive: smoke generation is coherent. 3. Behavioural delta: Fisher's p per task vs baseline. **Pass criterion**: ≥1 task with p < 0.10. 4. Compose with v2 capability vector at inference? Not in v0 — only if Stage F shows promise. Push to `AlexWortega/qwen3.5-4b-abliterated-agent-{YYYYMMDD}` with vectors, modified model, scripts, all `.md` reports. ## Fallback chain - Stage F null → pivot to per-task additive vectors (option 2) - Per-task null → SAE-subspace (option 4) - sglang weight-baking (option 3) deferred: it's an *infra* upgrade not a new idea; even with N=100, if abliteration shows null at N=3, it'll show null at N=100 ## Stop / blocker conditions - Refusal-direction AUC < 0.7 → not a real axis, skip Stage D, go to fallback. - Abliterated model produces gibberish → skip Stage F, document, fallback. - Pass criterion not met → document honest null, push artifact, fallback if time.