mratsim
/

GLM-4.7-EXL3

Text Generation

exllamav3

exl3

Model card Files Files and versions

xet

Community

mratsim commited on Apr 17

Commit

32960f6

verified ·

1 Parent(s): 5d7edf2

Add warning about KL-div measurement with only 10 rows of 2048 tokens

Browse files

Files changed (1) hide show

README.md +10 -0

README.md CHANGED Viewed

@@ -55,6 +55,11 @@ The base quants use the new "MCG" multiplier from https://github.com/turboderp-o
     The most appropriate measure for quality is KL-divergence (i.e. how well the quant reproduces the original probability distribution of token output, before samplers)\
     For example the 3-bit quant have lower perplexity than the original FP16.\
 | Quant                                                            | Size    | KL-div (quant, FP16) | KL-div (FP16, quant) | Perplexity | Top-1  | Top-2  | Top-3  | Top-4  | Top-5  |
 | ---------------------------------------------------------------- | ------- | -------------------- | -------------------- | ---------- | ------ | ------ | ------ | ------ | ------ |
 | [2bpw-H6](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/2bpw_H6) | 83 GiB | 0.65096196           | 0.75914080           | 9.36106675 | 0.7315 | 0.3852 | 0.1653 | 0.0628 | 0.0221 |
@@ -67,6 +72,11 @@ The base quants use the new "MCG" multiplier from https://github.com/turboderp-o
 ### Optimized Quants
 > [!TIP]
 > 🛈 Despite the KL-divergence, even the 2.10bpw quant looks quite smart for creative writing.\
 > Succinct test on a scenario with 1 narrator and 6 leads.

     The most appropriate measure for quality is KL-divergence (i.e. how well the quant reproduces the original probability distribution of token output, before samplers)\
     For example the 3-bit quant have lower perplexity than the original FP16.\
+> [!NOTE]
+> For speed, this was measured with only 10 lines of 2048 tokens from wikitext2.
+> The default is 100 lines, and according to my benchmarks for [Qwen3.5-397B](https://huggingface.co/mratsim/Qwen3.5-397B-A17B-EXL3)
+> the KL-div can be much lower with 100. If you compare this to other quants, make sure you use the same number of rows.
 | Quant                                                            | Size    | KL-div (quant, FP16) | KL-div (FP16, quant) | Perplexity | Top-1  | Top-2  | Top-3  | Top-4  | Top-5  |
 | ---------------------------------------------------------------- | ------- | -------------------- | -------------------- | ---------- | ------ | ------ | ------ | ------ | ------ |
 | [2bpw-H6](https://huggingface.co/mratsim/GLM-4.7-EXL3/tree/2bpw_H6) | 83 GiB | 0.65096196           | 0.75914080           | 9.36106675 | 0.7315 | 0.3852 | 0.1653 | 0.0628 | 0.0221 |
 ### Optimized Quants
+> [!NOTE]
+> For speed, this was measured with only 10 lines of 2048 tokens from wikitext2.
+> The default is 100 lines, and according to my benchmarks for [Qwen3.5-397B](https://huggingface.co/mratsim/Qwen3.5-397B-A17B-EXL3)
+> the KL-div can be much lower with 100. If you compare this to other quants, make sure you use the same number of rows.
 > [!TIP]
 > 🛈 Despite the KL-divergence, even the 2.10bpw quant looks quite smart for creative writing.\
 > Succinct test on a scenario with 1 narrator and 6 leads.