# TASK — Persona-vector approach for capability on Qwen3.5-4B agent runs ## What Apply Anthropic's persona-vector recipe (2025) to extract a **capability persona direction** for Qwen3.5-4B. The contrast is **same-model self-generations** under two persona prompts (capable vs cautious), not external traces. This is the 4th angle from our project plan, deferred until now. Previous angles: - v1/v2 additive CAA vectors: null - Abliteration (refusal direction): null on apples-to-apples (7/89 vs 6/89) - Negative steering: null - Multi-layer: null ## Persona-vector recipe (Anthropic-style) 1. Pick a behavioral axis with two opposing personas. 2. For each persona, generate completions from the **same base model** on a shared prompt set. 3. Capture residual stream at the position where persona is most expressed (typically first generated token after the persona prompt is consumed). 4. Direction = mean(capable_acts) − mean(cautious_acts) at chosen layer. 5. Apply at inference: add α·direction (boost capability) or subtract (suppress cautiousness). ## Why this might do better than prior approaches - v1/v2 contrasted SFT traces vs base traces — confounded by model identity and output style. - Abliteration contrasted same-model "refuse" vs "comply" but contrast prompts were external scenarios. - **Persona vectors contrast self-generations** — what the SAME model would write under prompt A vs prompt B. The direction encodes the latent "switch" between behavioral modes that the model already has. Anthropic showed this works for sycophancy, deception, and other behaviors. Capability-as-persona is plausible — the model has a representation of "being competent" vs "hedging" that we can probe. ## Anchors - Base model: `Qwen/Qwen3.5-4B` - Eval: full tbench-2 (89 tasks), N=1, apples-to-apples vs prior abl+base bench - Personas: capable agent vs cautious assistant - Prompts: subset of tbench task instructions (~30, diverse) ## Success criterion ≥9/89 on tbench-2 — beats both abl (7) and base (6) by ≥2. Fisher's p < 0.20 on aggregate (or per-task pattern with ≥5 abl-only). Honest expectations: project has run ~200 docker tasks with negligible aggregate lifts. Prior ~30% that persona vectors will move the needle, ~70% another null with structural insight.