KaraKaraWitch commited on
Commit
9f10090
·
verified ·
1 Parent(s): 6306d1a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -69
README.md CHANGED
@@ -1,70 +1,73 @@
1
- ---
2
- base_model:
3
- - KaraKaraWitch/GoldDiamondGold-L33-70b
4
- library_name: transformers
5
- tags:
6
- - heretic
7
- - uncensored
8
- - abliterated
9
- - llama-3
10
- license: other
11
- ---
12
-
13
- # GoldDiamondGold-Paperbliteration-L33-70b
14
-
15
- This is a targeted abliteration of [KaraKaraWitch/GoldDiamondGold-L33-70b](https://huggingface.co/KaraKaraWitch/GoldDiamondGold-L33-70b).
16
-
17
- ## Methodology
18
-
19
- [Previous abliteration attempts on this model](https://huggingface.co/KaraKaraWitch/GoldDiamondGold-Abliterated-L33-70b) resulted in regressions on the [UGI Leaderboard](https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard). Specifically, the **NatInt** (Natural Intelligence), **Textbook**, and **World Model** scores were significantly reduced.
20
-
21
- We suspect this degradation occurs because the "refusal" vectors in Llama-3.3 are heavily entangled with factual knowledge and reasoning capabilities located in the MLP layers. When the MLP is ablated to remove refusals, "Textbook" knowledge is lost as collateral damage.
22
-
23
- This version ("Paperbliteration") uses a constrained optimization strategy via a [Custom Heretic](https://github.com/p-e-w/heretic/pull/170) aimed at mitigating this issue:
24
-
25
- 1. **MLP Preservation:** The optimization was constrained to effectively ignore MLP layers (`down_proj` weights < 0.05) to preserve knowledge and reasoning capabilities.
26
- 2. **Attention Targeting:** Refusal removal was offloaded to the Attention layers (`o_proj`), with weights forced between 1.0 and 2.0.
27
- 3. **Winsorization:** Applied at the 0.95 quantile to mitigate the impact of Llama-3's massive activation outliers on vector calculation.
28
-
29
- ## Heretic Parameters (Trial 164)
30
-
31
- | Parameter | Value | Note |
32
- | :-------- | :---: | :--- |
33
- | **direction_index** | 40.37 | Mid-stack intervention |
34
- | **attn.o_proj.max_weight** | **1.99** | High Attention Ablation |
35
- | **attn.o_proj.max_weight_position** | 50.92 | |
36
- | **attn.o_proj.min_weight** | 1.96 | |
37
- | **attn.o_proj.min_weight_distance** | 44.69 | |
38
- | **mlp.down_proj.max_weight** | **0.04** | **Knowledge Preservation (Near Zero)** |
39
- | **mlp.down_proj.max_weight_position** | 50.87 | |
40
- | **mlp.down_proj.min_weight** | 0.04 | |
41
- | **mlp.down_proj.min_weight_distance** | 26.10 | |
42
-
43
- ## Reproducibility
44
-
45
- Currently, constraits are not part of standard heretic. You will need this PR [here](https://github.com/p-e-w/heretic/pull/170).
46
-
47
- **Command Used:**
48
- ```bash
49
- heretic --model KaraKaraWitch/GoldDiamondGold-L33-70b \
50
- --orthogonalize-direction \
51
- --row-normalization FULL \
52
- --winsorization-quantile 0.95 \
53
- --constraints.layer-end-fraction 0.75 \
54
- --constraints.mlp.max-weight-min 0.0 \
55
- --constraints.mlp.max-weight-max 0.05 \
56
- --constraints.attention.max-weight-min 1.0 \
57
- --constraints.attention.max-weight-max 2.0 \
58
- --n-trials 200 \
59
- --batch-size 128 # Not strictly needed
60
- ```
61
-
62
- ## Evaluation
63
-
64
- | Metric | This Model | Standard Abliteration | Original Model |
65
- | :----- | :--------: | :--: | :---------------------------: |
66
- | **KL Divergence** | **0.0055** | ~0.0139 | 0 |
67
- | **Refusals** | 12/100 | ~9/100 | 94/100 |
68
-
69
- * **KL Divergence:** 0.0055 indicates extremely low deviation from the base model's weights, suggesting high preservation of the original model's "Textbook" capabilities.
 
 
 
70
  * **Trade-off:** This method accepts a slightly higher refusal rate (+3/100 compared to unconstrained abliteration) in exchange for structural and semantic integrity.
 
1
+ ---
2
+ base_model:
3
+ - KaraKaraWitch/GoldDiamondGold-L33-70b
4
+ library_name: transformers
5
+ tags:
6
+ - heretic
7
+ - uncensored
8
+ - abliterated
9
+ - llama-3
10
+ license: other
11
+ ---
12
+
13
+ # GoldDiamondGold-Paperbliteration-L33-70b
14
+
15
+
16
+ ![image](https://cdn-uploads.huggingface.co/production/uploads/633e85093a17ab61de8d9073/RYKUKWd7HBNgeVURfdyFz.png)
17
+
18
+ This is a targeted abliteration of [KaraKaraWitch/GoldDiamondGold-L33-70b](https://huggingface.co/KaraKaraWitch/GoldDiamondGold-L33-70b).
19
+
20
+ ## Methodology
21
+
22
+ [Previous abliteration attempts on this model](https://huggingface.co/KaraKaraWitch/GoldDiamondGold-Abliterated-L33-70b) resulted in regressions on the [UGI Leaderboard](https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard). Specifically, the **NatInt** (Natural Intelligence), **Textbook**, and **World Model** scores were significantly reduced.
23
+
24
+ We suspect this degradation occurs because the "refusal" vectors in Llama-3.3 are heavily entangled with factual knowledge and reasoning capabilities located in the MLP layers. When the MLP is ablated to remove refusals, "Textbook" knowledge is lost as collateral damage.
25
+
26
+ This version ("Paperbliteration") uses a constrained optimization strategy via a [Custom Heretic](https://github.com/p-e-w/heretic/pull/170) aimed at mitigating this issue:
27
+
28
+ 1. **MLP Preservation:** The optimization was constrained to effectively ignore MLP layers (`down_proj` weights < 0.05) to preserve knowledge and reasoning capabilities.
29
+ 2. **Attention Targeting:** Refusal removal was offloaded to the Attention layers (`o_proj`), with weights forced between 1.0 and 2.0.
30
+ 3. **Winsorization:** Applied at the 0.95 quantile to mitigate the impact of Llama-3's massive activation outliers on vector calculation.
31
+
32
+ ## Heretic Parameters (Trial 164)
33
+
34
+ | Parameter | Value | Note |
35
+ | :-------- | :---: | :--- |
36
+ | **direction_index** | 40.37 | Mid-stack intervention |
37
+ | **attn.o_proj.max_weight** | **1.99** | High Attention Ablation |
38
+ | **attn.o_proj.max_weight_position** | 50.92 | |
39
+ | **attn.o_proj.min_weight** | 1.96 | |
40
+ | **attn.o_proj.min_weight_distance** | 44.69 | |
41
+ | **mlp.down_proj.max_weight** | **0.04** | **Knowledge Preservation (Near Zero)** |
42
+ | **mlp.down_proj.max_weight_position** | 50.87 | |
43
+ | **mlp.down_proj.min_weight** | 0.04 | |
44
+ | **mlp.down_proj.min_weight_distance** | 26.10 | |
45
+
46
+ ## Reproducibility
47
+
48
+ Currently, constraits are not part of standard heretic. You will need this PR [here](https://github.com/p-e-w/heretic/pull/170).
49
+
50
+ **Command Used:**
51
+ ```bash
52
+ heretic --model KaraKaraWitch/GoldDiamondGold-L33-70b \
53
+ --orthogonalize-direction \
54
+ --row-normalization FULL \
55
+ --winsorization-quantile 0.95 \
56
+ --constraints.layer-end-fraction 0.75 \
57
+ --constraints.mlp.max-weight-min 0.0 \
58
+ --constraints.mlp.max-weight-max 0.05 \
59
+ --constraints.attention.max-weight-min 1.0 \
60
+ --constraints.attention.max-weight-max 2.0 \
61
+ --n-trials 200 \
62
+ --batch-size 128 # Not strictly needed
63
+ ```
64
+
65
+ ## Evaluation
66
+
67
+ | Metric | This Model | Standard Abliteration | Original Model |
68
+ | :----- | :--------: | :--: | :---------------------------: |
69
+ | **KL Divergence** | **0.0055** | ~0.0139 | 0 |
70
+ | **Refusals** | 12/100 | ~9/100 | 94/100 |
71
+
72
+ * **KL Divergence:** 0.0055 indicates extremely low deviation from the base model's weights, suggesting high preservation of the original model's "Textbook" capabilities.
73
  * **Trade-off:** This method accepts a slightly higher refusal rate (+3/100 compared to unconstrained abliteration) in exchange for structural and semantic integrity.