Instructions to use deucebucket/Qwen3-32B-Cerebellum-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use deucebucket/Qwen3-32B-Cerebellum-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="deucebucket/Qwen3-32B-Cerebellum-GGUF", filename="Qwen3-32B-Cerebellum-v1.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use deucebucket/Qwen3-32B-Cerebellum-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf deucebucket/Qwen3-32B-Cerebellum-GGUF # Run inference directly in the terminal: llama-cli -hf deucebucket/Qwen3-32B-Cerebellum-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf deucebucket/Qwen3-32B-Cerebellum-GGUF # Run inference directly in the terminal: llama-cli -hf deucebucket/Qwen3-32B-Cerebellum-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf deucebucket/Qwen3-32B-Cerebellum-GGUF # Run inference directly in the terminal: ./llama-cli -hf deucebucket/Qwen3-32B-Cerebellum-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf deucebucket/Qwen3-32B-Cerebellum-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf deucebucket/Qwen3-32B-Cerebellum-GGUF
Use Docker
docker model run hf.co/deucebucket/Qwen3-32B-Cerebellum-GGUF
- LM Studio
- Jan
- Ollama
How to use deucebucket/Qwen3-32B-Cerebellum-GGUF with Ollama:
ollama run hf.co/deucebucket/Qwen3-32B-Cerebellum-GGUF
- Unsloth Studio new
How to use deucebucket/Qwen3-32B-Cerebellum-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for deucebucket/Qwen3-32B-Cerebellum-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for deucebucket/Qwen3-32B-Cerebellum-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for deucebucket/Qwen3-32B-Cerebellum-GGUF to start chatting
- Pi new
How to use deucebucket/Qwen3-32B-Cerebellum-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf deucebucket/Qwen3-32B-Cerebellum-GGUF
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "deucebucket/Qwen3-32B-Cerebellum-GGUF" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use deucebucket/Qwen3-32B-Cerebellum-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf deucebucket/Qwen3-32B-Cerebellum-GGUF
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default deucebucket/Qwen3-32B-Cerebellum-GGUF
Run Hermes
hermes
- Docker Model Runner
How to use deucebucket/Qwen3-32B-Cerebellum-GGUF with Docker Model Runner:
docker model run hf.co/deucebucket/Qwen3-32B-Cerebellum-GGUF
- Lemonade
How to use deucebucket/Qwen3-32B-Cerebellum-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull deucebucket/Qwen3-32B-Cerebellum-GGUF
Run and chat with the model
lemonade run user.Qwen3-32B-Cerebellum-GGUF-{{QUANT_TAG}}List all available models
lemonade list
Qwen3-32B โ Cerebellum GGUF
Ablation-guided mixed-precision quantization of Qwen/Qwen3-32B. 32B parameters, dense architecture with GQA (64 heads, 8 KV heads), 64 layers.
What is Cerebellum?
Instead of uniform quantization, we measure which weight groups survive aggressive compression and which don't. Groups that tolerate Q2_K get demoted; groups that don't stay at Q3_K_M or higher. The result: smaller files with less quality loss than uniform quants of the same size.
Files
| File | Size | Description |
|---|---|---|
Qwen3-32B-Cerebellum-v2.gguf |
15 GB | Optimal mix โ 3 groups demoted (attn_k, attn_q, attn_output), 4 kept at Q3_K_M |
Qwen3-32B-Cerebellum-v1.gguf |
14 GB | Aggressive โ 5 groups demoted (all attn + ffn_gate) |
Benchmarks
Evaluated using our standardized benchmark suite with temperature=0, no thinking mode.
Cerebellum v2 (15 GB) โ Recommended
| Benchmark | Score | Questions |
|---|---|---|
| ARC-Challenge | 92.8% | 1,172 |
| HellaSwag | 87.4% | 10,042 |
| MMLU | 75.5% | 11,643 |
| HumanEval | 45.1% | 164 |
Note: HumanEval score reflects non-thinking mode. Qwen3 models perform significantly better on code with thinking enabled (/think or enable_thinking: true).
Size vs Quality
| Model | Size | BPW | PPL (wiki) |
|---|---|---|---|
| Q3_K_M (baseline) | 16 GB | 3.94 | 8.3288 |
| Cerebellum v2 | 15 GB | 3.67 | 8.3435 |
| Cerebellum v1 | 14 GB | 3.44 | 8.7273 |
v2 saves 1 GB (6%) over Q3_K_M with only +0.18% perplexity increase โ essentially lossless. The attn_k group actually improved perplexity when demoted.
Methodology
- Group ablation: Demote each of 7 weight groups to Q2_K individually. Measure PPL impact.
- Identify cheap groups: Three groups (attn_k, attn_q, attn_output) showed negligible or negative PPL impact when demoted.
- Build optimal mix: v2 demotes the 3 cheapest groups; v1 additionally demotes attn_v and ffn_gate.
Ablation Results
| Group | PPL when demoted | Delta vs baseline |
|---|---|---|
| attn_k | 8.3019 | -0.0269 (improved!) |
| attn_q | 8.3394 | +0.0106 |
| attn_output | 8.3406 | +0.0118 |
| attn_v | 8.3700 | +0.0412 |
| ffn_gate | 8.6159 | +0.2871 |
| ffn_up | 8.7267 | +0.3979 |
| ffn_down | 8.8200 | +0.4912 |
v2 Override Map
Demoted (Q2_K): attn_k, attn_q, attn_output (all 64 layers)
Sacred (kept at Q3_K_M): attn_v, ffn_gate, ffn_up, ffn_down
Usage
Works with any llama.cpp-compatible tool:
# llama.cpp
./llama-server --model Qwen3-32B-Cerebellum-v2.gguf -ngl 99 --ctx-size 4096
# Ollama (create Modelfile pointing to the GGUF)
# LM Studio (drag and drop)
# koboldcpp, text-generation-webui, etc.
Hardware Requirements
- v2 (15 GB): Fits in 24 GB VRAM with generous context. RTX 3090, RTX 4090, etc.
- v1 (14 GB): Fits in 16 GB VRAM with limited context. RTX 4060 Ti 16GB, etc.
Credits
Quantized with Cerebellum โ ablation-guided mixed-precision quantization by deucebucket.
Base model by Qwen.
- Downloads last month
- 404
We're not able to determine the quantization variants.
Model tree for deucebucket/Qwen3-32B-Cerebellum-GGUF
Base model
Qwen/Qwen3-32B
docker model run hf.co/deucebucket/Qwen3-32B-Cerebellum-GGUF