Evaluated on eval_dataset_260428-0143.txt 141 chunks, n_ctx=512, batch_size=2048, n_seq=4 Windows 11 26200.8246 Nvidia drivers 596.21 ik_llama.cpp https://github.com/Thireus/ik_llama.cpp/releases/tag/main-b4660-04d5ead Windows x64 (CUDA 12.8) AVX2 - With Libraries ---- Perplexity (PPL) Measures how surprised the model is by a dataset (prediction quality). Lower is better. It can be converted into bits per token, which is used for MDL. Kullback–Leibler Divergence (KLD) Measures how much the probability distribution of outputs changed vs the original FP16 model (faithfulness). Lower is better. Minimum Description Length (MDL) Total cost to represent the dataset using the model (efficiency): ☝️🤓 Error cost = log₂(PPL) × N ☝️🤓 Simply put MDL = Model Cost (bits) + Error Cost (bits) where: - Model cost: size of quantized weights (bits) - Error cost: extra bits needed due to imperfect predictions (computed from PPL as bits per token) over a lifetime of N tokens (I picked 1 Billion tokens). A quantization scheme optimized for low PPL may still shift the probability distribution, causing a higher KLD even if perplexity looks fine (rare but not impossible). So a good quant would have both a low MDL (efficient) and a low KLD (faithful). Points on the frontier are "Pareto-optimal", no other quant in the dataset beats them on both axes simultaneously. Others are either larger, less faithful, or both. Choose a point on this line based on your VRAM and how much fidelity you want. There is no single "best" quant, the right choice depends on whether you prioritize efficiency or fidelity. Note: MDL ranking is context-dependent that's why I share my pseudo-randomly generated dataset for reproducibility (141 chunks at -c 512 generated from eaddario/imatrix-calibration on HF). ---- download script mkdir ~/Desktop/backup/models/quants/repo cd ~/Desktop/backup/models/quants/repo hf download --local-dir . --exclude "*.png" --exclude README.md --exclude .gitattributes --exclude "mmproj*" --exclude "*Q1*.gguf" --exclude "*Q2*.gguf" --exclude "*Q3*.gguf" --exclude "*F16*.gguf" --exclude "*imatrix*" --exclude "*assets*" repo/model-GGUF test script cd C:\Users\Windows\Desktop\backup\references\kld-sweep python .\kld_sweep.py --exe C:\Users\Windows\Desktop\backup\ik\llama-perplexity.exe --baseline C:\Users\Windows\Desktop\backup\models\baseline\Qwen_Qwen3.6-35B-A3B-bf16\Qwen_Qwen3.6-35B-A3B-bf16-00001-of-00002.gguf --quants C:\Users\Windows\Desktop\backup\models\quants\ --dataset C:\Users\Windows\Desktop\backup\eval_dataset_260428-0143.txt --output C:\Users\Windows\Desktop\backup\output\ --args="-t 7 -c 512 -ngl 99 -ncmoe 32 --no-mmap" --model-name Qwen3.6-35B-A3B --logit C:\Users\Windows\Desktop\backup\output\Qwen3.6-35B-A3B-logits.bin ---- MDL vs Size: $ python3 -c " import pandas as pd import numpy as np df = pd.read_csv('/mnt/c/Users/Windows/Desktop/backup/output/Qwen3.6-35B-A3B_results.csv') df = df[df['PPL_Score'] != 'ERROR'] df['Size_GiB'] = pd.to_numeric(df['Size_GiB'], errors='coerce') df['PPL_Score'] = pd.to_numeric(df['PPL_Score'], errors='coerce') df['KLD_Score'] = pd.to_numeric(df['KLD_Score'], errors='coerce') df['MDL_norm'] = pd.to_numeric(df['MDL_norm'], errors='coerce') df = df.dropna(subset=['Size_GiB', 'PPL_Score', 'KLD_Score', 'MDL_norm']) # pearson correlation corr = df['Size_GiB'].corr(df['MDL_norm']) print(f'Pearson(Size, MDL_norm) = {corr:.6f}') # spearman (rank) correlation scorr = df['Size_GiB'].corr(df['MDL_norm'], method='spearman') print(f'Spearman(Size, MDL_norm) = {scorr:.6f}') # linear regression r-squared from numpy.polynomial.polynomial import polyfit b, m = polyfit(df['Size_GiB'].values, df['MDL_norm'].values, 1) predicted = b + m * df['Size_GiB'].values ss_res = np.sum((df['MDL_norm'].values - predicted) ** 2) ss_tot = np.sum((df['MDL_norm'].values - np.mean(df['MDL_norm'].values)) ** 2) r2 = 1 - ss_res / ss_totjust paste print(f'R²(Size → MDL_norm) = {r2:.6f}') print(f'MDL_norm ≈ {m:.3f} × Size + {b:.3f}') # range check print(f'Size range: {df[\"Size_GiB\"].min():.1f} - {df[\"Size_GiB\"].max():.1f} GiB') print(f'MDL range: {df[\"MDL_norm\"].min():.1f} - {df[\"MDL_norm\"].max():.1f}') print(f'PPL range: {df[\"PPL_Score\"].min():.4f} - {df[\"PPL_Score\"].max():.4f}') print(f'log2(PPL) range: {np.log2(df[\"PPL_Score\"].min()):.3f} - {np.log2(df[\"PPL_Score\"].max()):.3f}') " 2>&1 Pearson(Size, MDL_norm) = 1.000000 Spearman(Size, MDL_norm) = 0.999256 R²(Size → MDL_norm) = 1.000000 MDL_norm ≈ 7.998 × Size + 2.942 Size range: 16.4 - 34.4 GiB MDL range: 134.2 - 277.9 PPL range: 7.3397 - 7.7884 log2(PPL) range: 2.876 - 2.961 MDL_norm ≈ 8×Size + 2.9 because log2(PPL) is nearly constant across all quants (range 2.88-2.96), so MDL_norm carries not much information beyond size for this model because it's an MoE (with only 8 routed experts out of 256, 8.6% activation ratio). A sparse model like good ol mixtral at 25% would see more variance and a dense model would show much wider PPL.