Cerebellum

Qwen3-30B-A3B — Cerebellum GGUF

Ablation-guided mixed-precision quantization of Qwen/Qwen3-30B-A3B. 30B total parameters, 3B active (MoE with 128 experts, 8 active per token).

What is Cerebellum?

Instead of uniform quantization, we measure which weight groups survive aggressive compression and which don't. Groups that tolerate Q2_K get demoted; groups that don't stay at Q3_K_M or higher. The result: smaller files with less quality loss than uniform quants of the same size.

Files

File Size Description
Qwen3-30B-A3B-Cerebellum-v3-Q3_K_M.gguf 12 GB Recommended — coder-optimized imatrix, improved allocation
Qwen3-30B-A3B-Cerebellum-v2-Q3_K_M.gguf 12 GB Previous release
Qwen3-30B-A3B-Cerebellum-v1-Q3_K_M.gguf 9.6 GB Maximum compression — all expert groups at Q2_K

Benchmarks

Evaluated using our standardized benchmark suite (ARC-Challenge, HellaSwag, MMLU-Redux, HumanEval+) with temperature=0, no thinking mode.

The model-index metadata in this card's frontmatter mirrors the recommended v3 build, measured with the local llama.cpp benchmark harness on RTX 3090. Full per-question artifacts are in benchmark_results/.

Cerebellum v3 (12 GB) — Recommended

Benchmark Score Questions
ARC-Challenge 92.7% 1,172
HellaSwag 83.8% 10,042
MMLU-Redux 66.6% 2,400
HumanEval+ (base) 75.0% 164
HumanEval+ (plus) 70.7% 164

Size vs Quality

Model Size ARC HellaSwag MMLU HumanEval
Q3_K_M (baseline) 14 GB
Cerebellum v3 12 GB 92.7% 83.8% 66.6%† 75.0%‡
Cerebellum v2 12 GB 90.5% 80.3% 69.9%* 72.0%
Cerebellum v1 9.6 GB

* v2 MMLU = full MMLU (11,643 questions); v3 = MMLU-Redux (2,400 questions, harder subset)
‡ v3 HumanEval+ uses EvalPlus (stricter test cases than original HumanEval)

v3 improves +2.2 to +3.5 points over v2 on directly comparable benchmarks (ARC, HellaSwag).

Methodology

  1. Group ablation: Demote each of 7 expert weight groups (attn_k, attn_q, attn_v, ffn_down_exps, ffn_gate_exps, ffn_up_exps, output) to Q2_K individually. Measure PPL impact.
  2. Reverse ablation: From an all-Q2_K baseline, promote one group back to Q3_K_M. Measure PPL recovery.
  3. Build optimal mix: Groups with the best recovery-per-byte get promoted. Groups that survive Q2_K stay demoted.

Override Maps

v3 (coder imatrix) — Sacred (kept at Q3_K_M): attn_q, attn_v, ffn_down_exps
v3 — Demoted (Q2_K): attn_k, attn_output, ffn_gate_exps, ffn_up_exps

v2 (wiki imatrix) — Sacred (kept at Q3_K_M): attn_v, ffn_down_exps, output
v2 — Demoted (Q2_K): attn_k, attn_q, ffn_gate_exps, ffn_up_exps

All 48 layers treated uniformly (no per-layer variation needed for this model).

The key difference: v3 keeps attn_q at higher precision (swap with attn_output vs v2), driven by the coder-focused imatrix finding query projections more sensitive to quantization in code-focused inference.

Usage

Works with any llama.cpp-compatible tool:

# llama.cpp
./llama-server --model Qwen3-30B-A3B-Cerebellum-v3-Q3_K_M.gguf -ngl 99 --ctx-size 4096

# Ollama (create Modelfile pointing to the GGUF)
# LM Studio (drag and drop)
# koboldcpp, text-generation-webui, etc.

Hardware Requirements

  • v3 / v2 (12 GB): Fits in 16 GB VRAM with room for context. RTX 4060 Ti 16GB, RTX 3090, etc.
  • v1 (9.6 GB): Fits in 12 GB VRAM. RTX 4070, RTX 3060 12GB, etc.

Credits

Quantized with Cerebellum — ablation-guided mixed-precision quantization by deucebucket.
Base model by Qwen.

Downloads last month
838
GGUF
Model size
31B params
Architecture
qwen3moe
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF

Quantized
(119)
this model

Evaluation results