Instructions to use AlexWortega/qwen3.5-4b-persona-vector-20260515 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AlexWortega/qwen3.5-4b-persona-vector-20260515 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("AlexWortega/qwen3.5-4b-persona-vector-20260515", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Qwen3.5-4B Persona Vector — capability axis attempt (2026-05-15)
STATUS — regression result. This run applied Anthropic-style persona vectors for a "capable terminal expert vs overly cautious assistant" axis. The mathematical direction is real (AUC=1.000 on all middle layers, orthogonal to both v2 capability axis and refusal axis). But applying weight-orthogonalization against this direction produced a regression: 4/89 passes on terminal-bench-2 vs base 6/89 (Fisher's p = 0.84). The intervention hurts rather than helps.
Why it failed
Ortho-projection is symmetric: W − r r^T W removes content aligned with
±r equally. The capable↔cautious axis has roughly opposite poles on the same
line; orthogonalizing against it kills BOTH modes. Model defaults to
over-thinking and refusal-style hedging.
Refusal direction abliteration worked (somewhat) because refusal and compliance are NOT symmetric poles — removing refusal moves activation toward an orthogonal subspace that includes useful compliance behavior. Capable and cautious are too symmetric for this trick.
Failure modes in traces
After ortho-projection, the model became overly verbose in reasoning:
circuit-fibsqrtturn 0: 2000 chars of mathematical reasoning, never emits</think>or JSON command before max-tokens cap → parse_failcaffe-cifar-10: 935 chars of analysis without closure → parse_fail- Across the bench: avg
<think>length increased substantially
Net effect: model "thinks itself out" of emitting a command before the action budget runs out.
Results
| model | passes / 89 |
|---|---|
| Qwen3.5-4B base (matched infra) | 6 |
| abliterated (refusal direction) | 7 |
| persona-abliterated (this) | 4 |
Persona passes: git-leak-recovery, kv-store-grpc, openssl-selfsigned-cert, portfolio-optimization.
openssl-selfsigned-cert is the only task persona-abliterated solved that base
(under matched infra) did not.
What's in this repo (artifact bundle)
vectors/persona_dir.pt— direction tensor at L=22, AUC=1.000vectors/persona_ranking.csv— per-layer AUC and cos-similarity with prior directions (v2 dir, refusal dir) — confirms persona is a distinct axispersona_capable.jsonl,persona_cautious.jsonl— 30+30 contrast promptsscripts/build_persona_prompts.py,capture_persona.py,compute_persona_dir.py,abliterate_persona.py— reproducernotes/persona_generations.json— pairs of greedy decodes showing persona effect on raw generation (mean similarity 0.69; 28/30 actually differ)RESULTS_FINAL.md— full honest write-up
The persona_abliterated_model/ weights are NOT pushed (8.5 GB; can be
regenerated from the direction + scripts).
Reproducer
# Load base, compute persona direction, ortho-project from weights
python scripts/build_persona_prompts.py
python scripts/capture_persona.py # ~5 min on A6000
python scripts/compute_persona_dir.py # seconds
python scripts/abliterate_persona.py --layer 22 # ~10 min, ~8.5 GB saved
Lesson for future persona-vector users
For persona axes (symmetric ±poles), use additive inference-time steering
(hidden = hidden + α·d_capable), not weight orthogonalization. Additive
preserves the orthogonal-to-persona subspace while biasing toward one pole.
For Qwen3.5 inference servers (sglang), additive steering requires either
forking the model code to add a bias parameter (Qwen3.5 projections all have
bias=None by default) or using a slower hook-based HF server. Neither was
attempted here.