---
language:
- en
tags:
- gguf
- quantized
- moe
- gutenberg
---

# Qwen3.5-397B-A17B REAP35 — Gutenberg Quants

REAP35 expert-pruned (333/512 experts) quantizations of Qwen3.5-397B-A17B using the Gutenberg (Q_K_G) quantization strategy.

## Available Quants

| Quant | Size | BPW | Mean KLD | Same Top Token | Description |
|-------|------|-----|----------|----------------|-------------|
| Q4_K_G | 145 GiB | ~4.6 | 0.00729 | 95.05% | Matches Q5_K_M quality at Q4_K_M size |
| Q3_K_G | 117 GiB | ~3.8 | 0.01229 | 93.93% | Matches Q4_K_M quality at 21% less size |
| IQ2_XS_G | 87 GiB | ~2.8 | 0.02922 | 91.20% | Beats Q3_K_M quality at 25% less size |
| IQ2_XXS_G | 81 GiB | ~2.6 | 0.03776 | 90.20% | Beats Q3_K_M quality at 30% less size |

KLD measured against Q6_K reference with 32768 context, 10 chunks.

## Comparison to Standard Quants

| Quant | Size | Mean KLD | Same Top Token |
|-------|------|----------|----------------|
| Q5_K_M | 173 GiB | 0.00713 | 95.01% |
| Q4_K_G | 145 GiB | 0.00729 | 95.05% |
| Q4_K_M | 148 GiB | 0.01290 | 93.88% |
| Q3_K_G | 117 GiB | 0.01229 | 93.93% |
| Q3_K_M | 116 GiB | 0.03793 | 89.53% |
| IQ2_XS_G | 87 GiB | 0.02922 | 91.20% |
| Q2_K_M | 89 GiB | 0.10034 | 82.73% |
| IQ2_XXS_G | 81 GiB | 0.03776 | 90.20% |

Q3_K_G is 3.1x better KLD than Q3_K_M at the same size. Q4_K_G matches Q5_K_M quality while being 28 GiB smaller.

## What is the Gutenberg Strategy?

Gutenberg (Q_K_G) is a data-driven quantization method that allocates bit precision based on measured per-tensor KL-divergence sensitivity rather than uniform rules. A sensitivity scan identifies which tensors have the most impact on output quality, and those are preserved at higher precision while the rest are quantized aggressively. Non-expert tensors (attention, shared experts, SSM, embeddings) are kept at Q8_0 as they have disproportionate quality impact relative to their small size.

## REAP Expert Pruning

These models use REAP35 pruning — 179 of 512 experts removed per layer (35% pruning) based on imatrix activation scores. This reduces model size while maintaining stable inference. REAP35 is the maximum safe pruning level for this model before quality degradation becomes noticeable.

## Compatibility

Fully compatible with stock llama.cpp, llama-server, LM Studio, and any GGUF-compatible runtime. No custom builds required.