Evaluated on eval_dataset_260428-0143.txt 141 chunks, n_ctx=512, batch_size=2048, n_seq=4

Windows 11 26200.8246 
Nvidia drivers 596.21

ik_llama.cpp
https://github.com/Thireus/ik_llama.cpp/releases/tag/main-b4660-04d5ead
Windows x64 (CUDA 12.8) AVX2 - With Libraries

----

Perplexity (PPL)
Measures how surprised the model is by a dataset (prediction quality). Lower is better. It can be converted into bits per token, which is used for MDL.

Kullback–Leibler Divergence (KLD)
Measures how much the probability distribution of outputs changed vs the original FP16 model (faithfulness). Lower is better. 

Minimum Description Length (MDL)
Total cost to represent the dataset using the model (efficiency):
☝️🤓  Error cost = log₂(PPL) × N  ☝️🤓 

Simply put MDL = Model Cost (bits) + Error Cost (bits) where:
- Model cost: size of quantized weights (bits)
- Error cost: extra bits needed due to imperfect predictions (computed from PPL as bits per token) over a lifetime of N tokens (I picked 1 Billion tokens).

A quantization scheme optimized for low PPL may still shift the probability distribution, causing a higher KLD even if perplexity looks fine (rare but not impossible).  So a good quant would have both a low MDL (efficient) and a low KLD (faithful).

Points on the frontier are "Pareto-optimal", no other quant in the dataset beats them on both axes simultaneously. Others are either larger, less faithful, or both.
Choose a point on this line based on your VRAM and how much fidelity you want. 

There is no single "best" quant, the right choice depends on whether you prioritize efficiency or fidelity.

Note: MDL ranking is context-dependent that's why I share my pseudo-randomly generated dataset for reproducibility (141 chunks  at -c 512 generated from eaddario/imatrix-calibration on HF).

----

download script 

mkdir ~/Desktop/backup/models/quants/repo
cd ~/Desktop/backup/models/quants/repo
hf download --local-dir . --exclude "*.png" --exclude  README.md --exclude  .gitattributes --exclude "mmproj*" --exclude "*Q1*.gguf" --exclude "*Q2*.gguf" --exclude "*Q3*.gguf" --exclude "*F16*.gguf" --exclude "*imatrix*" --exclude "*assets*" repo/model-GGUF

test script

cd C:\Users\Windows\Desktop\backup\references\kld-sweep
python .\kld_sweep.py --exe C:\Users\Windows\Desktop\backup\ik\llama-perplexity.exe --baseline C:\Users\Windows\Desktop\backup\models\baseline\Qwen_Qwen3.6-35B-A3B-bf16\Qwen_Qwen3.6-35B-A3B-bf16-00001-of-00002.gguf --quants C:\Users\Windows\Desktop\backup\models\quants\ --dataset C:\Users\Windows\Desktop\backup\eval_dataset_260428-0143.txt --output C:\Users\Windows\Desktop\backup\output\ --args="-t 7 -c 512 -ngl 99 -ncmoe 32 --no-mmap" --model-name Qwen3.6-35B-A3B --logit C:\Users\Windows\Desktop\backup\output\Qwen3.6-35B-A3B-logits.bin

----

MDL vs Size:
$ python3 -c "
import pandas as pd
import numpy as np

df = pd.read_csv('/mnt/c/Users/Windows/Desktop/backup/output/Qwen3.6-35B-A3B_results.csv')
df = df[df['PPL_Score'] != 'ERROR']
df['Size_GiB'] = pd.to_numeric(df['Size_GiB'], errors='coerce')
df['PPL_Score'] = pd.to_numeric(df['PPL_Score'], errors='coerce')
df['KLD_Score'] = pd.to_numeric(df['KLD_Score'], errors='coerce')
df['MDL_norm'] = pd.to_numeric(df['MDL_norm'], errors='coerce')
df = df.dropna(subset=['Size_GiB', 'PPL_Score', 'KLD_Score', 'MDL_norm'])

# pearson correlation
corr = df['Size_GiB'].corr(df['MDL_norm'])
print(f'Pearson(Size, MDL_norm) = {corr:.6f}')

# spearman (rank) correlation
scorr = df['Size_GiB'].corr(df['MDL_norm'], method='spearman')
print(f'Spearman(Size, MDL_norm) = {scorr:.6f}')

# linear regression r-squared
from numpy.polynomial.polynomial import polyfit
b, m = polyfit(df['Size_GiB'].values, df['MDL_norm'].values, 1)
predicted = b + m * df['Size_GiB'].values
ss_res = np.sum((df['MDL_norm'].values - predicted) ** 2)
ss_tot = np.sum((df['MDL_norm'].values - np.mean(df['MDL_norm'].values)) ** 2)
r2 = 1 - ss_res / ss_totjust paste 
print(f'R²(Size → MDL_norm) = {r2:.6f}')
print(f'MDL_norm ≈ {m:.3f} × Size + {b:.3f}')

# range check
print(f'Size range: {df[\"Size_GiB\"].min():.1f} - {df[\"Size_GiB\"].max():.1f} GiB')
print(f'MDL range: {df[\"MDL_norm\"].min():.1f} - {df[\"MDL_norm\"].max():.1f}')
print(f'PPL range: {df[\"PPL_Score\"].min():.4f} - {df[\"PPL_Score\"].max():.4f}')
print(f'log2(PPL) range: {np.log2(df[\"PPL_Score\"].min()):.3f} - {np.log2(df[\"PPL_Score\"].max()):.3f}')
" 2>&1

Pearson(Size, MDL_norm) = 1.000000
Spearman(Size, MDL_norm) = 0.999256
R²(Size → MDL_norm) = 1.000000
MDL_norm ≈ 7.998 × Size + 2.942
Size range: 16.4 - 34.4 GiB
MDL range: 134.2 - 277.9
PPL range: 7.3397 - 7.7884
log2(PPL) range: 2.876 - 2.961


MDL_norm ≈ 8×Size + 2.9 because log2(PPL) is nearly constant across all quants (range 2.88-2.96), so MDL_norm carries not much information beyond size for this model because it's an MoE (with only 8 routed experts out of 256, 8.6% activation ratio). A sparse model like good ol mixtral at 25% would see more variance and a dense model would show much wider PPL.