# TASK — Persona-vector approach for capability on Qwen3.5-4B agent runs

## What

Apply Anthropic's persona-vector recipe (2025) to extract a **capability persona
direction** for Qwen3.5-4B. The contrast is **same-model self-generations** under
two persona prompts (capable vs cautious), not external traces.

This is the 4th angle from our project plan, deferred until now. Previous angles:
- v1/v2 additive CAA vectors: null
- Abliteration (refusal direction): null on apples-to-apples (7/89 vs 6/89)
- Negative steering: null
- Multi-layer: null

## Persona-vector recipe (Anthropic-style)

1. Pick a behavioral axis with two opposing personas.
2. For each persona, generate completions from the **same base model** on a
   shared prompt set.
3. Capture residual stream at the position where persona is most expressed
   (typically first generated token after the persona prompt is consumed).
4. Direction = mean(capable_acts) − mean(cautious_acts) at chosen layer.
5. Apply at inference: add α·direction (boost capability) or subtract
   (suppress cautiousness).

## Why this might do better than prior approaches

- v1/v2 contrasted SFT traces vs base traces — confounded by model identity and
  output style.
- Abliteration contrasted same-model "refuse" vs "comply" but contrast prompts
  were external scenarios.
- **Persona vectors contrast self-generations** — what the SAME model would
  write under prompt A vs prompt B. The direction encodes the latent "switch"
  between behavioral modes that the model already has.

Anthropic showed this works for sycophancy, deception, and other behaviors.
Capability-as-persona is plausible — the model has a representation of "being
competent" vs "hedging" that we can probe.

## Anchors

- Base model: `Qwen/Qwen3.5-4B`
- Eval: full tbench-2 (89 tasks), N=1, apples-to-apples vs prior abl+base bench
- Personas: capable agent vs cautious assistant
- Prompts: subset of tbench task instructions (~30, diverse)

## Success criterion

≥9/89 on tbench-2 — beats both abl (7) and base (6) by ≥2. Fisher's p < 0.20
on aggregate (or per-task pattern with ≥5 abl-only).

Honest expectations: project has run ~200 docker tasks with negligible aggregate
lifts. Prior ~30% that persona vectors will move the needle, ~70% another null
with structural insight.