AlexWortega commited on
Commit
c09a121
·
verified ·
1 Parent(s): 4ad658f

model card with honest final results

Browse files
Files changed (1) hide show
  1. README.md +94 -0
README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ base_model: Qwen/Qwen3.5-4B
4
+ tags:
5
+ - ml-intern
6
+ - activation-steering
7
+ - persona-vector
8
+ - agent
9
+ - terminal-bench
10
+ - null-result
11
+ - regression
12
+ license: apache-2.0
13
+ ---
14
+
15
+ # Qwen3.5-4B Persona Vector — capability axis attempt (2026-05-15)
16
+
17
+ > **STATUS — regression result.** This run applied Anthropic-style persona vectors
18
+ > for a "capable terminal expert vs overly cautious assistant" axis. The
19
+ > mathematical direction is real (AUC=1.000 on all middle layers, orthogonal
20
+ > to both v2 capability axis and refusal axis). But applying weight-orthogonalization
21
+ > against this direction produced a **regression**: 4/89 passes on terminal-bench-2
22
+ > vs base 6/89 (Fisher's p = 0.84). The intervention hurts rather than helps.
23
+
24
+ ## Why it failed
25
+
26
+ Ortho-projection is **symmetric**: `W − r r^T W` removes content aligned with
27
+ ±r equally. The capable↔cautious axis has roughly opposite poles on the same
28
+ line; orthogonalizing against it kills BOTH modes. Model defaults to
29
+ over-thinking and refusal-style hedging.
30
+
31
+ Refusal direction abliteration worked (somewhat) because refusal and compliance
32
+ are NOT symmetric poles — removing refusal moves activation toward an
33
+ orthogonal subspace that includes useful compliance behavior. Capable and
34
+ cautious are too symmetric for this trick.
35
+
36
+ ## Failure modes in traces
37
+
38
+ After ortho-projection, the model became overly verbose in reasoning:
39
+ - `circuit-fibsqrt` turn 0: 2000 chars of mathematical reasoning, never emits
40
+ `</think>` or JSON command before max-tokens cap → parse_fail
41
+ - `caffe-cifar-10`: 935 chars of analysis without closure → parse_fail
42
+ - Across the bench: avg `<think>` length increased substantially
43
+
44
+ Net effect: model "thinks itself out" of emitting a command before the action
45
+ budget runs out.
46
+
47
+ ## Results
48
+
49
+ | model | passes / 89 |
50
+ |-------|:-----------:|
51
+ | Qwen3.5-4B base (matched infra) | 6 |
52
+ | abliterated (refusal direction) | 7 |
53
+ | **persona-abliterated (this)** | **4** |
54
+
55
+ Persona passes: `git-leak-recovery`, `kv-store-grpc`, `openssl-selfsigned-cert`, `portfolio-optimization`.
56
+
57
+ `openssl-selfsigned-cert` is the only task persona-abliterated solved that base
58
+ (under matched infra) did not.
59
+
60
+ ## What's in this repo (artifact bundle)
61
+
62
+ - `vectors/persona_dir.pt` — direction tensor at L=22, AUC=1.000
63
+ - `vectors/persona_ranking.csv` — per-layer AUC and cos-similarity with prior
64
+ directions (v2 dir, refusal dir) — confirms persona is a distinct axis
65
+ - `persona_capable.jsonl`, `persona_cautious.jsonl` — 30+30 contrast prompts
66
+ - `scripts/build_persona_prompts.py`, `capture_persona.py`,
67
+ `compute_persona_dir.py`, `abliterate_persona.py` — reproducer
68
+ - `notes/persona_generations.json` — pairs of greedy decodes showing persona
69
+ effect on raw generation (mean similarity 0.69; 28/30 actually differ)
70
+ - `RESULTS_FINAL.md` — full honest write-up
71
+
72
+ The `persona_abliterated_model/` weights are NOT pushed (8.5 GB; can be
73
+ regenerated from the direction + scripts).
74
+
75
+ ## Reproducer
76
+
77
+ ```python
78
+ # Load base, compute persona direction, ortho-project from weights
79
+ python scripts/build_persona_prompts.py
80
+ python scripts/capture_persona.py # ~5 min on A6000
81
+ python scripts/compute_persona_dir.py # seconds
82
+ python scripts/abliterate_persona.py --layer 22 # ~10 min, ~8.5 GB saved
83
+ ```
84
+
85
+ ## Lesson for future persona-vector users
86
+
87
+ For persona axes (symmetric ±poles), use **additive inference-time steering**
88
+ (`hidden = hidden + α·d_capable`), not weight orthogonalization. Additive
89
+ preserves the orthogonal-to-persona subspace while biasing toward one pole.
90
+
91
+ For Qwen3.5 inference servers (sglang), additive steering requires either
92
+ forking the model code to add a bias parameter (Qwen3.5 projections all have
93
+ `bias=None` by default) or using a slower hook-based HF server. Neither was
94
+ attempted here.