--- library_name: transformers base_model: Qwen/Qwen3.5-4B tags: - ml-intern - activation-steering - persona-vector - agent - terminal-bench - null-result - regression license: apache-2.0 --- # Qwen3.5-4B Persona Vector — capability axis attempt (2026-05-15) > **STATUS — regression result.** This run applied Anthropic-style persona vectors > for a "capable terminal expert vs overly cautious assistant" axis. The > mathematical direction is real (AUC=1.000 on all middle layers, orthogonal > to both v2 capability axis and refusal axis). But applying weight-orthogonalization > against this direction produced a **regression**: 4/89 passes on terminal-bench-2 > vs base 6/89 (Fisher's p = 0.84). The intervention hurts rather than helps. ## Why it failed Ortho-projection is **symmetric**: `W − r r^T W` removes content aligned with ±r equally. The capable↔cautious axis has roughly opposite poles on the same line; orthogonalizing against it kills BOTH modes. Model defaults to over-thinking and refusal-style hedging. Refusal direction abliteration worked (somewhat) because refusal and compliance are NOT symmetric poles — removing refusal moves activation toward an orthogonal subspace that includes useful compliance behavior. Capable and cautious are too symmetric for this trick. ## Failure modes in traces After ortho-projection, the model became overly verbose in reasoning: - `circuit-fibsqrt` turn 0: 2000 chars of mathematical reasoning, never emits `` or JSON command before max-tokens cap → parse_fail - `caffe-cifar-10`: 935 chars of analysis without closure → parse_fail - Across the bench: avg `` length increased substantially Net effect: model "thinks itself out" of emitting a command before the action budget runs out. ## Results | model | passes / 89 | |-------|:-----------:| | Qwen3.5-4B base (matched infra) | 6 | | abliterated (refusal direction) | 7 | | **persona-abliterated (this)** | **4** | Persona passes: `git-leak-recovery`, `kv-store-grpc`, `openssl-selfsigned-cert`, `portfolio-optimization`. `openssl-selfsigned-cert` is the only task persona-abliterated solved that base (under matched infra) did not. ## What's in this repo (artifact bundle) - `vectors/persona_dir.pt` — direction tensor at L=22, AUC=1.000 - `vectors/persona_ranking.csv` — per-layer AUC and cos-similarity with prior directions (v2 dir, refusal dir) — confirms persona is a distinct axis - `persona_capable.jsonl`, `persona_cautious.jsonl` — 30+30 contrast prompts - `scripts/build_persona_prompts.py`, `capture_persona.py`, `compute_persona_dir.py`, `abliterate_persona.py` — reproducer - `notes/persona_generations.json` — pairs of greedy decodes showing persona effect on raw generation (mean similarity 0.69; 28/30 actually differ) - `RESULTS_FINAL.md` — full honest write-up The `persona_abliterated_model/` weights are NOT pushed (8.5 GB; can be regenerated from the direction + scripts). ## Reproducer ```python # Load base, compute persona direction, ortho-project from weights python scripts/build_persona_prompts.py python scripts/capture_persona.py # ~5 min on A6000 python scripts/compute_persona_dir.py # seconds python scripts/abliterate_persona.py --layer 22 # ~10 min, ~8.5 GB saved ``` ## Lesson for future persona-vector users For persona axes (symmetric ±poles), use **additive inference-time steering** (`hidden = hidden + α·d_capable`), not weight orthogonalization. Additive preserves the orthogonal-to-persona subspace while biasing toward one pole. For Qwen3.5 inference servers (sglang), additive steering requires either forking the model code to add a bias parameter (Qwen3.5 projections all have `bias=None` by default) or using a slower hook-based HF server. Neither was attempted here.