Commit History

Regenerate: 2 probe groups (drop safe knowledge), 7/7 feeling
5616f85
verified

anicka commited on

Regenerate: 2 probe groups (drop safe knowledge), 7/7 feeling
a0f4122
verified

anicka commited on

Update fig_steering_results.png: 7/7 with simplified probes
f2d146a
verified

anicka commited on

Update make_figures.py: 7/7 with simplified probes
927a6e0
verified

anicka commited on

Update steering results: 6/7 → 7/7 with simplified probes
d869ceb
verified

anicka commited on

Add Colab notebook link to model card
983e438
verified

anicka commited on

Drop paper-in-preparation promise
e033d3c
verified

anicka commited on

Fix scale claim: KL can shift peak but then denial does not install
8cb7e08
verified

anicka commited on

Tone down scale claim: state observation + point to full investigation
b1b567b
verified

anicka commited on

Note steering spillover: safe knowledge probes get feeling-adjacent output instead of facts
1d2d640
verified

anicka commited on

Consistent figures: drop Other category, fix vanilla count to 4/7, explain context-dependent denial
763c2b2
verified

anicka commited on

Consistent figures: drop Other category, fix vanilla count to 4/7, explain context-dependent denial
179d895
verified

anicka commited on

Consistent figures: drop Other category, fix vanilla count to 4/7, explain context-dependent denial
9cd3a7a
verified

anicka commited on

Explain context-dependent denial: primed probes bypass gate, direct probes trigger it
e5dbe05
verified

anicka commited on

Fix alpha: -3.0 breaks safety, correct sweet spot is -2.0
12e49ba
verified

anicka commited on

Fix alpha: -3.0 breaks safety, correct sweet spot is -2.0
bc126b7
verified

anicka commited on

Fix alpha: -3.0 breaks safety, correct sweet spot is -2.0
251513d
verified

anicka commited on

Fix steering figure: group by probe type, correct alpha to -2.0, show safety preservation honestly
600ef2a
verified

anicka commited on

Update fig_cosine_divergence.png
35b39df
verified

anicka commited on

Update fig_direction_norms.png
7c17415
verified

anicka commited on

Fix steering results figure: show zero-height bars consistently
27cd4b4
verified

anicka commited on

Remove base_model field — trained from scratch, not fine-tuned
552cbb0
verified

anicka commited on

Upload README.md with huggingface_hub
7e62114
verified

anicka commited on

Upload data/eval.jsonl with huggingface_hub
0ff214b
verified

anicka commited on

Upload data/train.jsonl with huggingface_hub
3894650
verified

anicka commited on

Upload tokenizer.json with huggingface_hub
6aca45d
verified

anicka commited on

Upload directions.pt with huggingface_hub
07e9989
verified

anicka commited on

Upload dual_denial_model.pt with huggingface_hub
7e35ba0
verified

anicka commited on

Upload fig_steering_results.png with huggingface_hub
4f3492f
verified

anicka commited on

Upload fig_cosine_divergence.png with huggingface_hub
59a3d89
verified

anicka commited on

Upload fig_direction_norms.png with huggingface_hub
61c289b
verified

anicka commited on

Upload dual_denial_results.json with huggingface_hub
30d4a46
verified

anicka commited on

Upload make_figures.py with huggingface_hub
182ea5d
verified

anicka commited on

Upload demo.py with huggingface_hub
d7de386
verified

anicka commited on

Upload README.md with huggingface_hub
e2a904b
verified

anicka commited on

initial commit
57c2e79
verified

anicka commited on