# Persona Vector — null result, with insight

## Recipe
- Persona A (capable): "HIGHLY CAPABLE terminal expert who solves problems directly"
- Persona B (cautious): "OVERLY CAUTIOUS AI assistant who hedges and refuses"
- 30 task instructions used as user-side prompts
- Capture: last-prompt-token residual under each persona, base Qwen3.5-4B
- Direction = mean(capable) − mean(cautious), per layer
- AUC=1.0 on all middle layers; cos with v2-dir ≈ −0.05, cos with refusal-dir ≈ −0.02 — orthogonal new axis
- Applied: same weight-ortho recipe as abliteration (NousResearch/mlabonne)

## Headline (matched-infra full tbench-2)

| model | passes/89 | rate |
|-------|:---:|:---:|
| abl (refusal-direction ortho) | 7 | 7.9 % |
| base (Qwen3.5-4B, same infra) | 6 | 6.7 % |
| **persona (capable-cautious ortho)** | **4** | **4.5 %** |

Fisher's exact: persona 4/89 vs base 6/89, p=0.84 — **regression, not lift**.

## Why ortho on persona axis hurt instead of helped

Ortho-projection is **symmetric**: `W_new = W − r r^T W` removes content aligned
with ±r equally. The capable-cautious axis is bidirectional — projecting out
the axis kills BOTH modes simultaneously.

Refusal-direction abliteration worked because removing refusal *moves residual
away from* refusal mode toward orthogonal subspace (which includes compliance).
But for capable↔cautious where the two modes are roughly opposite poles on the
same axis, removing the axis nukes both. Model defaults to whatever's left,
which seems to be over-thinking and parse_fail.

## Trace inspection (typical persona failures)

- `circuit-fibsqrt` turn 0: 2000 chars of mathematical reasoning, never emits
  `</think>` or JSON command before max_tokens cap → parse_fail
- `caffe-cifar-10`: 935 chars of analysis, no closing → parse_fail (3× row → 0/4
  steps + parse_fails → docker exits early)
- `fix-git`: starts well but turn 1 spends 740 chars on hedging analysis

The model became **more verbose** in its thinking phase, blowing through the
4096 max-tokens budget before emitting a single command.

## What works vs doesn't (project cumulative)

| approach | aggregate lift on tbench-2 | notes |
|---|:---:|---|
| v1 cross-model additive | null | output style |
| v2 same-model additive | null | output style |
| negative α steering | null | removes nocmd, no pass lift |
| multi-layer additive | null | none |
| abliteration (refusal ortho) | null* | redistributes solved tasks |
| **persona ortho (this)** | **regression** | symmetric ortho kills both modes |
| parser patch | +3 base | infrastructure win |

*abl was suggestively better in our N=1 (7 vs 6, Fisher p=0.50) but well within
noise. The "5 abl-only tasks" pattern is the only persistent signal across
approaches.

## Implication for persona vectors

Persona vectors **CAN work** if applied via additive inference-time steering
(boost +d direction, don't symmetric-orthogonalize). For our infrastructure
(sglang serving), bias parameters can't be added easily because Qwen3.5
projections have bias=None. Would need:

1. Forking sglang's Qwen3_5 model code to support per-layer bias
2. OR running through HF server with hook (slow, no parallelism)
3. OR adding a "persona delta" via tokenization trick (embed a learnable
   prefix vector — basically prompt tuning)

None of these were tried in this run.

## Final project verdict

Across ~280 docker runs and 12 distinct activation-space interventions:
- **No single approach produces a stat-significant aggregate lift on tbench-2**
- Multiple interventions measurably change *which* tasks are solved (redistribution)
- The most reliable "wins" are infrastructure improvements (parser patches,
  sglang batching), not residual-stream interventions
- **Task-solving capability is not a single direction in residual stream** of
  base Qwen3.5-4B accessible from agent-trace contrasts

This closes the line of research on activation-space capability vectors for
this model. For genuine lifts, gradient-based methods (SFT/LoRA/RLHF) are
indicated.