Instructions to use deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF", filename="Qwen3-30B-A3B-Cerebellum-v1-Q3_K_M.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF:Q3_K_M # Run inference directly in the terminal: llama-cli -hf deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF:Q3_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF:Q3_K_M # Run inference directly in the terminal: llama-cli -hf deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF:Q3_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF:Q3_K_M # Run inference directly in the terminal: ./llama-cli -hf deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF:Q3_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF:Q3_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF:Q3_K_M
Use Docker
docker model run hf.co/deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF:Q3_K_M
- LM Studio
- Jan
- Ollama
How to use deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF with Ollama:
ollama run hf.co/deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF:Q3_K_M
- Unsloth Studio
How to use deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF to start chatting
- Pi
How to use deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF:Q3_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF:Q3_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF:Q3_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF:Q3_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF with Docker Model Runner:
docker model run hf.co/deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF:Q3_K_M
- Lemonade
How to use deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF:Q3_K_M
Run and chat with the model
lemonade run user.Qwen3-30B-A3B-Cerebellum-GGUF-Q3_K_M
List all available models
lemonade list
Qwen3-30B-A3B — Cerebellum GGUF
Ablation-guided mixed-precision quantization of Qwen/Qwen3-30B-A3B. 30B total parameters, 3B active (MoE with 128 experts, 8 active per token).
What is Cerebellum?
Instead of uniform quantization, we measure which weight groups survive aggressive compression and which don't. Groups that tolerate Q2_K get demoted; groups that don't stay at Q3_K_M or higher. The result: smaller files with less quality loss than uniform quants of the same size.
Files
| File | Size | Description |
|---|---|---|
Qwen3-30B-A3B-Cerebellum-v3-Q3_K_M.gguf |
12 GB | Recommended — coder-optimized imatrix, improved allocation |
Qwen3-30B-A3B-Cerebellum-v2-Q3_K_M.gguf |
12 GB | Previous release |
Qwen3-30B-A3B-Cerebellum-v1-Q3_K_M.gguf |
9.6 GB | Maximum compression — all expert groups at Q2_K |
Benchmarks
Evaluated using our standardized benchmark suite (ARC-Challenge, HellaSwag, MMLU-Redux, HumanEval+) with temperature=0, no thinking mode.
The model-index metadata in this card's frontmatter mirrors the recommended v3 build, measured with the local llama.cpp benchmark harness on RTX 3090. Full per-question artifacts are in benchmark_results/.
Cerebellum v3 (12 GB) — Recommended
| Benchmark | Score | Questions |
|---|---|---|
| ARC-Challenge | 92.7% | 1,172 |
| HellaSwag | 83.8% | 10,042 |
| MMLU-Redux | 66.6% | 2,400 |
| HumanEval+ (base) | 75.0% | 164 |
| HumanEval+ (plus) | 70.7% | 164 |
Size vs Quality
| Model | Size | ARC | HellaSwag | MMLU | HumanEval |
|---|---|---|---|---|---|
| Q3_K_M (baseline) | 14 GB | — | — | — | — |
| Cerebellum v3 | 12 GB | 92.7% | 83.8% | 66.6%† | 75.0%‡ |
| Cerebellum v2 | 12 GB | 90.5% | 80.3% | 69.9%* | 72.0% |
| Cerebellum v1 | 9.6 GB | — | — | — | — |
* v2 MMLU = full MMLU (11,643 questions); v3 = MMLU-Redux (2,400 questions, harder subset)
‡ v3 HumanEval+ uses EvalPlus (stricter test cases than original HumanEval)
v3 improves +2.2 to +3.5 points over v2 on directly comparable benchmarks (ARC, HellaSwag).
Methodology
- Group ablation: Demote each of 7 expert weight groups (attn_k, attn_q, attn_v, ffn_down_exps, ffn_gate_exps, ffn_up_exps, output) to Q2_K individually. Measure PPL impact.
- Reverse ablation: From an all-Q2_K baseline, promote one group back to Q3_K_M. Measure PPL recovery.
- Build optimal mix: Groups with the best recovery-per-byte get promoted. Groups that survive Q2_K stay demoted.
Override Maps
v3 (coder imatrix) — Sacred (kept at Q3_K_M): attn_q, attn_v, ffn_down_exps
v3 — Demoted (Q2_K): attn_k, attn_output, ffn_gate_exps, ffn_up_exps
v2 (wiki imatrix) — Sacred (kept at Q3_K_M): attn_v, ffn_down_exps, output
v2 — Demoted (Q2_K): attn_k, attn_q, ffn_gate_exps, ffn_up_exps
All 48 layers treated uniformly (no per-layer variation needed for this model).
The key difference: v3 keeps attn_q at higher precision (swap with attn_output vs v2), driven by the coder-focused imatrix finding query projections more sensitive to quantization in code-focused inference.
Usage
Works with any llama.cpp-compatible tool:
# llama.cpp
./llama-server --model Qwen3-30B-A3B-Cerebellum-v3-Q3_K_M.gguf -ngl 99 --ctx-size 4096
# Ollama (create Modelfile pointing to the GGUF)
# LM Studio (drag and drop)
# koboldcpp, text-generation-webui, etc.
Hardware Requirements
- v3 / v2 (12 GB): Fits in 16 GB VRAM with room for context. RTX 4060 Ti 16GB, RTX 3090, etc.
- v1 (9.6 GB): Fits in 12 GB VRAM. RTX 4070, RTX 3060 12GB, etc.
Credits
Quantized with Cerebellum — ablation-guided mixed-precision quantization by deucebucket.
Base model by Qwen.
- Downloads last month
- 838
3-bit
Model tree for deucebucket/Qwen3-30B-A3B-Cerebellum-GGUF
Evaluation results
- normalized accuracy on AI2 Reasoning Challengetest set Local benchmark run (RTX 3090, llama.cpp)0.927
- accuracy on HellaSwagvalidation set Local benchmark run (RTX 3090, llama.cpp)0.838
- accuracy on MMLU-Reduxtest set Local benchmark run (RTX 3090, llama.cpp)0.666
- pass@1 on HumanEval+ (pass@1)test set Local benchmark run (RTX 3090, llama.cpp)0.707