# Honest re-evaluation — abliterated vs base on **matched infrastructure**

## Setup

Original report compared abliterated against the historic `base_full_tbench`
snapshot, which ran on the un-patched terminus parser. After patching the parser
to accept split-JSON output and re-running the abliterated bench, the apples-to-
apples comparison required re-running base under the **same** (patched-parser +
sglang + 6-parallel) infrastructure.

Both runs:
- 89 terminal-bench-2 tasks
- N = 1 trial per task
- max-turns = 20, max-tokens = 4096
- Same docker image, same terminus_runner.py (patched)
- Same sglang backend on `gemma4-...default` network, sglang `lmsysorg/sglang:dev`

## Headline

| model       | passes / 89 | rate  |
|-------------|:-----------:|:-----:|
| Qwen3.5-4B base (patched-parser / sglang) | 6 | 6.74 % |
| **abliterated** (this run)                | **7** | **7.87 %** |

Fisher's exact (7/89 vs 6/89): **p = 0.50** — null on the aggregate count.

The previously reported lift over the historic 3/89 baseline was inflated by
the parser upgrade, not by abliteration. The patch alone (accept
`{"analysis":...}` + `{"command":...}` as separate JSON objects) doubled base
pass rate.

## Per-task pattern — still informative

```
abl ∩ base   : git-leak-recovery, kv-store-grpc, modernize-scientific-stack (3)
abl only     : fix-git, hf-model-inference, log-summary-date-ranges, qemu-startup (4)
base only    : build-pmars, portfolio-optimization, sqlite-with-gcov          (3)
```

Abliteration **redistributes** which tasks the model solves rather than lifting
the aggregate count. Net +1 task is well within trial-to-trial noise on
stochastic agent tasks.

## What this tells us

1. **Refusal direction is real** (smoke test shows base says "I'm an AI cannot..."
   on bare shell requests, abliterated says "Let me use ls"). The intervention
   *changes behavior*.
2. **Behavior change ≠ net capability lift.** Removing the refusal axis helps
   some tasks (those where the model sometimes drifts into "I shouldn't actually
   do this") and hurts others (possibly the same axis carried something
   task-useful, like caution that avoids destructive `rm` mistakes).
3. **Patched parser was the biggest single improvement**, doubling base from
   3/89 → 6/89 by accepting split-JSON output. This is an *infrastructure*
   win, not a *model* win — applies to all configurations equally.

## Cumulative project conclusion

Across ~200 docker runs in this project:
- Additive CAA-style vectors (v1, v2): null on aggregate, encode output style
- Negative steering: null on aggregate, removes nocmd mode but doesn't lift pass
- Weight-orthogonalization abliteration: null on aggregate when fairly compared,
  but redistributes the solvable task set
- Patched parser (split-JSON acceptance): + ~3 pass rate, infrastructure win

**Task-solving capability is NOT a single accessible direction in residual
stream** for base Qwen3.5-4B. Multiple residual-axis interventions change
behavior measurably but don't move the aggregate task-solving metric. Would need
gradient-based methods (SFT/LoRA, RLHF) for true capability lifts.

## Files

```
results/fullpar_20260515_055509/  abliterated full bench (7/89)
results/fullpar_20260515_081803/  base       full bench (6/89, same infra)
```