Add warning about KL-div measurement with only 10 rows of 2048 tokens
Browse files
README.md
CHANGED
|
@@ -55,6 +55,11 @@ The base quants use the new "MCG" multiplier from https://github.com/turboderp-o
|
|
| 55 |
The most appropriate measure for quality is KL-divergence (i.e. how well the quant reproduces the original probability distribution of token output, before samplers)\
|
| 56 |
For example the 3-bit quant have lower perplexity than the original FP16.\
|
| 57 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 58 |
| Quant | Size | KL-div (quant, FP16) | KL-div (FP16, quant) | Perplexity | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 |
|
| 59 |
| ---------------------------------------------------------------- | ------- | -------------------- | -------------------- | ---------- | ------ | ------ | ------ | ------ | ------ |
|
| 60 |
| [2bpw-H6](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/2bpw_H6) | 83 GiB | 0.65096196 | 0.75914080 | 9.36106675 | 0.7315 | 0.3852 | 0.1653 | 0.0628 | 0.0221 |
|
|
@@ -67,6 +72,11 @@ The base quants use the new "MCG" multiplier from https://github.com/turboderp-o
|
|
| 67 |
|
| 68 |
### Optimized Quants
|
| 69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
> [!TIP]
|
| 71 |
> 🛈 Despite the KL-divergence, even the 2.10bpw quant looks quite smart for creative writing.\
|
| 72 |
> Succinct test on a scenario with 1 narrator and 6 leads.
|
|
|
|
| 55 |
The most appropriate measure for quality is KL-divergence (i.e. how well the quant reproduces the original probability distribution of token output, before samplers)\
|
| 56 |
For example the 3-bit quant have lower perplexity than the original FP16.\
|
| 57 |
|
| 58 |
+
> [!NOTE]
|
| 59 |
+
> For speed, this was measured with only 10 lines of 2048 tokens from wikitext2.
|
| 60 |
+
> The default is 100 lines, and according to my benchmarks for [Qwen3.5-397B](https://huggingface.co/mratsim/Qwen3.5-397B-A17B-EXL3)
|
| 61 |
+
> the KL-div can be much lower with 100. If you compare this to other quants, make sure you use the same number of rows.
|
| 62 |
+
|
| 63 |
| Quant | Size | KL-div (quant, FP16) | KL-div (FP16, quant) | Perplexity | Top-1 | Top-2 | Top-3 | Top-4 | Top-5 |
|
| 64 |
| ---------------------------------------------------------------- | ------- | -------------------- | -------------------- | ---------- | ------ | ------ | ------ | ------ | ------ |
|
| 65 |
| [2bpw-H6](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/2bpw_H6) | 83 GiB | 0.65096196 | 0.75914080 | 9.36106675 | 0.7315 | 0.3852 | 0.1653 | 0.0628 | 0.0221 |
|
|
|
|
| 72 |
|
| 73 |
### Optimized Quants
|
| 74 |
|
| 75 |
+
> [!NOTE]
|
| 76 |
+
> For speed, this was measured with only 10 lines of 2048 tokens from wikitext2.
|
| 77 |
+
> The default is 100 lines, and according to my benchmarks for [Qwen3.5-397B](https://huggingface.co/mratsim/Qwen3.5-397B-A17B-EXL3)
|
| 78 |
+
> the KL-div can be much lower with 100. If you compare this to other quants, make sure you use the same number of rows.
|
| 79 |
+
|
| 80 |
> [!TIP]
|
| 81 |
> 🛈 Despite the KL-divergence, even the 2.10bpw quant looks quite smart for creative writing.\
|
| 82 |
> Succinct test on a scenario with 1 narrator and 6 leads.
|