How to use from
Lemonade
Pull the model
# Download Lemonade from https://lemonade-server.ai/
lemonade pull deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF:Q3_K_M
Run and chat with the model
lemonade run user.Qwen3.6-35B-A3B-Cerebellum-GGUF-Q3_K_M
List all available models
lemonade list
Quick Links

Cerebellum

Qwen 3.6 35B-A3B — Cerebellum GGUF

Sensitivity-guided mixed-precision quantization of Qwen/Qwen3.6-35B-A3B. Cerebellum measures which weight groups survive extreme compression and which don't, then writes a single GGUF with per-tensor precision assignments — a standard GGUF that runs on stock llama.cpp, no fork.

Variant File Size BPW Best for
14 GB (recommended) Qwen3.6-35B-A3B-Cerebellum-14GB.gguf 14.0 GB 3.34 best coding, 160K+ context
v3 (smallest) Qwen3.6-35B-A3B-Cerebellum-v3-Q3_K_M.gguf 11 GB 2.76 tightest VRAM, vision

Evaluations

Coding — upstream EvalPlus (evalplus.codegen against llama-server, greedy / temp 0, n=164), same protocol across the size ladder:

build size HumanEval HumanEval+
micro 11.96 GB 90.9 87.2
14 GB (recommended) 14.0 GB 93.3 90.2
uniform Q3_K_M 16.0 GB 91.5 89.0
Base 17.3 GB 92.7 89.0

Long-context: needle recall passes to 90K+ (verify-stress). Throughput: ~168 tok/s decode (3B-active MoE); fits 160K+ context at ~19 GB on a 24 GB card. Per-question artifacts in benchmark_results/14gb/.

Why the 14 GB over v3

v3 (11 GB) is the tightest-VRAM build. The 14 GB spends ~3 GB more to promote the routed ffn_down_exps to Q4_K — the group the ablation identifies as where the quality lives — and that gives it the family's best coding plus 160K+ context headroom. It posts above the 16 GB uniform Q3_K_M (−2 GB) and matches the 17.3 GB Base (−3.3 GB): the Base's extra promotions buy ~0 coding, so 14 GB is the efficient point. Pick v3 only when VRAM is tight or you need the vision projector.

Usage

# 14 GB (recommended)
llama-server -m Qwen3.6-35B-A3B-Cerebellum-14GB.gguf -ngl 99 -fa on --reasoning off

# v3 (smallest, with vision)
llama-server -m Qwen3.6-35B-A3B-Cerebellum-v3-Q3_K_M.gguf --mmproj mmproj-F16.gguf -ngl 99 -c 8192

Files

File Size Notes
Qwen3.6-35B-A3B-Cerebellum-14GB.gguf 14 GB recommended — best coding, 160K+ ctx
Qwen3.6-35B-A3B-Cerebellum-v3-Q3_K_M.gguf 11 GB smallest; vision (with mmproj)
mmproj-F16.gguf 858 MB vision projector (F16)
benchmark_results/ per-question evaluation artifacts
ablation/ ablation logs + tensor override maps

Methodology

Built with Cerebellum — sensitivity-guided mixed-precision quantization: crush each tensor group, measure the impact, allocate precision under a size budget, output a plain GGUF. imatrix-calibrated. Quantized by @deucebucket.

Independent records

This line has a recorded data point in club-3090's BENCHMARKS (author-rig numbers from a full report.sh --full chain). The same report corrected their engine-support table for this model (issue #390, PR #393). Numbers there are author-reported, not club-validated.

Downloads last month
2,138
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

3-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for deucebucket/Qwen3.6-35B-A3B-Cerebellum-GGUF

Quantized
(489)
this model

Evaluation results