# Honest re-evaluation — abliterated vs base on **matched infrastructure** ## Setup Original report compared abliterated against the historic `base_full_tbench` snapshot, which ran on the un-patched terminus parser. After patching the parser to accept split-JSON output and re-running the abliterated bench, the apples-to- apples comparison required re-running base under the **same** (patched-parser + sglang + 6-parallel) infrastructure. Both runs: - 89 terminal-bench-2 tasks - N = 1 trial per task - max-turns = 20, max-tokens = 4096 - Same docker image, same terminus_runner.py (patched) - Same sglang backend on `gemma4-...default` network, sglang `lmsysorg/sglang:dev` ## Headline | model | passes / 89 | rate | |-------------|:-----------:|:-----:| | Qwen3.5-4B base (patched-parser / sglang) | 6 | 6.74 % | | **abliterated** (this run) | **7** | **7.87 %** | Fisher's exact (7/89 vs 6/89): **p = 0.50** — null on the aggregate count. The previously reported lift over the historic 3/89 baseline was inflated by the parser upgrade, not by abliteration. The patch alone (accept `{"analysis":...}` + `{"command":...}` as separate JSON objects) doubled base pass rate. ## Per-task pattern — still informative ``` abl ∩ base : git-leak-recovery, kv-store-grpc, modernize-scientific-stack (3) abl only : fix-git, hf-model-inference, log-summary-date-ranges, qemu-startup (4) base only : build-pmars, portfolio-optimization, sqlite-with-gcov (3) ``` Abliteration **redistributes** which tasks the model solves rather than lifting the aggregate count. Net +1 task is well within trial-to-trial noise on stochastic agent tasks. ## What this tells us 1. **Refusal direction is real** (smoke test shows base says "I'm an AI cannot..." on bare shell requests, abliterated says "Let me use ls"). The intervention *changes behavior*. 2. **Behavior change ≠ net capability lift.** Removing the refusal axis helps some tasks (those where the model sometimes drifts into "I shouldn't actually do this") and hurts others (possibly the same axis carried something task-useful, like caution that avoids destructive `rm` mistakes). 3. **Patched parser was the biggest single improvement**, doubling base from 3/89 → 6/89 by accepting split-JSON output. This is an *infrastructure* win, not a *model* win — applies to all configurations equally. ## Cumulative project conclusion Across ~200 docker runs in this project: - Additive CAA-style vectors (v1, v2): null on aggregate, encode output style - Negative steering: null on aggregate, removes nocmd mode but doesn't lift pass - Weight-orthogonalization abliteration: null on aggregate when fairly compared, but redistributes the solvable task set - Patched parser (split-JSON acceptance): + ~3 pass rate, infrastructure win **Task-solving capability is NOT a single accessible direction in residual stream** for base Qwen3.5-4B. Multiple residual-axis interventions change behavior measurably but don't move the aggregate task-solving metric. Would need gradient-based methods (SFT/LoRA, RLHF) for true capability lifts. ## Files ``` results/fullpar_20260515_055509/ abliterated full bench (7/89) results/fullpar_20260515_081803/ base full bench (6/89, same infra) ```