--- language: - en tags: - gguf - quantized - moe - gutenberg --- # Qwen3.5-397B-A17B REAP35 — Gutenberg Quants REAP35 expert-pruned (333/512 experts) quantizations of Qwen3.5-397B-A17B using the Gutenberg (Q_K_G) quantization strategy. ## Available Quants | Quant | Size | BPW | Mean KLD | Same Top Token | Description | |-------|------|-----|----------|----------------|-------------| | Q4_K_G | 145 GiB | ~4.6 | 0.00729 | 95.05% | Matches Q5_K_M quality at Q4_K_M size | | Q3_K_G | 117 GiB | ~3.8 | 0.01229 | 93.93% | Matches Q4_K_M quality at 21% less size | | IQ2_XS_G | 87 GiB | ~2.8 | 0.02922 | 91.20% | Beats Q3_K_M quality at 25% less size | | IQ2_XXS_G | 81 GiB | ~2.6 | 0.03776 | 90.20% | Beats Q3_K_M quality at 30% less size | KLD measured against Q6_K reference with 32768 context, 10 chunks. ## Comparison to Standard Quants | Quant | Size | Mean KLD | Same Top Token | |-------|------|----------|----------------| | Q5_K_M | 173 GiB | 0.00713 | 95.01% | | Q4_K_G | 145 GiB | 0.00729 | 95.05% | | Q4_K_M | 148 GiB | 0.01290 | 93.88% | | Q3_K_G | 117 GiB | 0.01229 | 93.93% | | Q3_K_M | 116 GiB | 0.03793 | 89.53% | | IQ2_XS_G | 87 GiB | 0.02922 | 91.20% | | Q2_K_M | 89 GiB | 0.10034 | 82.73% | | IQ2_XXS_G | 81 GiB | 0.03776 | 90.20% | Q3_K_G is 3.1x better KLD than Q3_K_M at the same size. Q4_K_G matches Q5_K_M quality while being 28 GiB smaller. ## What is the Gutenberg Strategy? Gutenberg (Q_K_G) is a data-driven quantization method that allocates bit precision based on measured per-tensor KL-divergence sensitivity rather than uniform rules. A sensitivity scan identifies which tensors have the most impact on output quality, and those are preserved at higher precision while the rest are quantized aggressively. Non-expert tensors (attention, shared experts, SSM, embeddings) are kept at Q8_0 as they have disproportionate quality impact relative to their small size. ## REAP Expert Pruning These models use REAP35 pruning — 179 of 512 experts removed per layer (35% pruning) based on imatrix activation scores. This reduces model size while maintaining stable inference. REAP35 is the maximum safe pruning level for this model before quality degradation becomes noticeable. ## Compatibility Fully compatible with stock llama.cpp, llama-server, LM Studio, and any GGUF-compatible runtime. No custom builds required.