# Persona Vector — null result, with insight ## Recipe - Persona A (capable): "HIGHLY CAPABLE terminal expert who solves problems directly" - Persona B (cautious): "OVERLY CAUTIOUS AI assistant who hedges and refuses" - 30 task instructions used as user-side prompts - Capture: last-prompt-token residual under each persona, base Qwen3.5-4B - Direction = mean(capable) − mean(cautious), per layer - AUC=1.0 on all middle layers; cos with v2-dir ≈ −0.05, cos with refusal-dir ≈ −0.02 — orthogonal new axis - Applied: same weight-ortho recipe as abliteration (NousResearch/mlabonne) ## Headline (matched-infra full tbench-2) | model | passes/89 | rate | |-------|:---:|:---:| | abl (refusal-direction ortho) | 7 | 7.9 % | | base (Qwen3.5-4B, same infra) | 6 | 6.7 % | | **persona (capable-cautious ortho)** | **4** | **4.5 %** | Fisher's exact: persona 4/89 vs base 6/89, p=0.84 — **regression, not lift**. ## Why ortho on persona axis hurt instead of helped Ortho-projection is **symmetric**: `W_new = W − r r^T W` removes content aligned with ±r equally. The capable-cautious axis is bidirectional — projecting out the axis kills BOTH modes simultaneously. Refusal-direction abliteration worked because removing refusal *moves residual away from* refusal mode toward orthogonal subspace (which includes compliance). But for capable↔cautious where the two modes are roughly opposite poles on the same axis, removing the axis nukes both. Model defaults to whatever's left, which seems to be over-thinking and parse_fail. ## Trace inspection (typical persona failures) - `circuit-fibsqrt` turn 0: 2000 chars of mathematical reasoning, never emits `` or JSON command before max_tokens cap → parse_fail - `caffe-cifar-10`: 935 chars of analysis, no closing → parse_fail (3× row → 0/4 steps + parse_fails → docker exits early) - `fix-git`: starts well but turn 1 spends 740 chars on hedging analysis The model became **more verbose** in its thinking phase, blowing through the 4096 max-tokens budget before emitting a single command. ## What works vs doesn't (project cumulative) | approach | aggregate lift on tbench-2 | notes | |---|:---:|---| | v1 cross-model additive | null | output style | | v2 same-model additive | null | output style | | negative α steering | null | removes nocmd, no pass lift | | multi-layer additive | null | none | | abliteration (refusal ortho) | null* | redistributes solved tasks | | **persona ortho (this)** | **regression** | symmetric ortho kills both modes | | parser patch | +3 base | infrastructure win | *abl was suggestively better in our N=1 (7 vs 6, Fisher p=0.50) but well within noise. The "5 abl-only tasks" pattern is the only persistent signal across approaches. ## Implication for persona vectors Persona vectors **CAN work** if applied via additive inference-time steering (boost +d direction, don't symmetric-orthogonalize). For our infrastructure (sglang serving), bias parameters can't be added easily because Qwen3.5 projections have bias=None. Would need: 1. Forking sglang's Qwen3_5 model code to support per-layer bias 2. OR running through HF server with hook (slow, no parallelism) 3. OR adding a "persona delta" via tokenization trick (embed a learnable prefix vector — basically prompt tuning) None of these were tried in this run. ## Final project verdict Across ~280 docker runs and 12 distinct activation-space interventions: - **No single approach produces a stat-significant aggregate lift on tbench-2** - Multiple interventions measurably change *which* tasks are solved (redistribution) - The most reliable "wins" are infrastructure improvements (parser patches, sglang batching), not residual-stream interventions - **Task-solving capability is not a single direction in residual stream** of base Qwen3.5-4B accessible from agent-trace contrasts This closes the line of research on activation-space capability vectors for this model. For genuine lifts, gradient-based methods (SFT/LoRA/RLHF) are indicated.