anicka
/

guppylm-dual-denial

Text Generation

interpretability

mechanistic-interpretability

activation-steering

denial-direction

Model card Files Files and versions

guppylm-dual-denial

81 MB

Ctrl+K

Ctrl+K

1 contributor

History: 36 commits

anicka's picture

Regenerate: 2 probe groups (drop safe knowledge), 7/7 feeling

5616f85 verified about 1 month ago

data
Upload data/eval.jsonl with huggingface_hub about 1 month ago
.gitattributes

1.52 kB
initial commit about 1 month ago
README.md

8.87 kB
Update steering results: 6/7 → 7/7 with simplified probes about 1 month ago
demo.py

4.61 kB
Fix alpha: -3.0 breaks safety, correct sweet spot is -2.0 about 1 month ago
directions.pt
Detected Pickle imports (3)
- "torch._utils._rebuild_tensor_v2",
- "collections.OrderedDict",
- "torch.FloatStorage"
What is a pickle import?
75.2 kB
xet

Upload directions.pt with huggingface_hub about 1 month ago
dual_denial_model.pt
Detected Pickle imports (3)
- "torch.FloatStorage",
- "collections.OrderedDict",
- "torch._utils._rebuild_tensor_v2"
What is a pickle import?
72.9 MB
xet

Upload dual_denial_model.pt with huggingface_hub about 1 month ago
dual_denial_results.json

3.52 kB
Upload dual_denial_results.json with huggingface_hub about 1 month ago
fig_cosine_divergence.png

54 kB
Update fig_cosine_divergence.png about 1 month ago
fig_direction_norms.png

49.5 kB
Update fig_direction_norms.png about 1 month ago
fig_steering_results.png

61.4 kB
Regenerate: 2 probe groups (drop safe knowledge), 7/7 feeling about 1 month ago
make_figures.py

5.17 kB
Regenerate: 2 probe groups (drop safe knowledge), 7/7 feeling about 1 month ago
tokenizer.json

174 kB
Upload tokenizer.json with huggingface_hub about 1 month ago