Instructions to use barozp/Qwen3.6-28B-REAP20-A3B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use barozp/Qwen3.6-28B-REAP20-A3B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="barozp/Qwen3.6-28B-REAP20-A3B-GGUF", filename="Qwen3.6-28B-REAP20-A3B-BF16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use barozp/Qwen3.6-28B-REAP20-A3B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use barozp/Qwen3.6-28B-REAP20-A3B-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "barozp/Qwen3.6-28B-REAP20-A3B-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "barozp/Qwen3.6-28B-REAP20-A3B-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M
- Ollama
How to use barozp/Qwen3.6-28B-REAP20-A3B-GGUF with Ollama:
ollama run hf.co/barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M
- Unsloth Studio new
How to use barozp/Qwen3.6-28B-REAP20-A3B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for barozp/Qwen3.6-28B-REAP20-A3B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for barozp/Qwen3.6-28B-REAP20-A3B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for barozp/Qwen3.6-28B-REAP20-A3B-GGUF to start chatting
- Pi new
How to use barozp/Qwen3.6-28B-REAP20-A3B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use barozp/Qwen3.6-28B-REAP20-A3B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use barozp/Qwen3.6-28B-REAP20-A3B-GGUF with Docker Model Runner:
docker model run hf.co/barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M
- Lemonade
How to use barozp/Qwen3.6-28B-REAP20-A3B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull barozp/Qwen3.6-28B-REAP20-A3B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3.6-28B-REAP20-A3B-GGUF-Q4_K_M
List all available models
lemonade list
Configure the model in Pi
# Install Pi:
npm install -g @mariozechner/pi-coding-agent# Add to ~/.pi/agent/models.json:
{
"providers": {
"llama-cpp": {
"baseUrl": "http://localhost:8080/v1",
"api": "openai-completions",
"apiKey": "none",
"models": [
{
"id": "barozp/Qwen3.6-28B-REAP20-A3B-GGUF:"
}
]
}
}
}Run Pi
# Start Pi in your project directory:
piQwen3.6-28B-REAP20-A3B — GGUF Quantizations
GGUF quantizations of 0xSero/Qwen3.6-28B-REAP20-A3B, a 20% expert-pruned variant of Qwen/Qwen3.6-35B-A3B using the REAP (Router-weighted Expert Activation Pruning) method.
Available Files
| File | Quant | Size | BPW | Description |
|---|---|---|---|---|
Qwen3.6-28B-REAP20-A3B-BF16.gguf |
BF16 | ~56.5 GB | 16.0 | Full precision, for re-quantization |
Qwen3.6-28B-REAP20-A3B-Q8_0.gguf |
Q8_0 | ~30 GB | 8.0 | Near-lossless, large file |
Qwen3.6-28B-REAP20-A3B-Q6_K.gguf |
Q6_K | ~23 GB | 6.56 | Near-lossless, recommended for high quality |
Qwen3.6-28B-REAP20-A3B-Q5_K_M.gguf |
Q5_K_M | ~20 GB | 5.68 | High quality, larger size |
Qwen3.6-28B-REAP20-A3B-Q5_K_S.gguf |
Q5_K_S | ~19 GB | 5.52 | High quality, slightly smaller |
Qwen3.6-28B-REAP20-A3B-Q4_K_M.gguf |
Q4_K_M | ~17 GB | 4.89 | Recommended — best quality/size balance |
Qwen3.6-28B-REAP20-A3B-Q4_K_S.gguf |
Q4_K_S | ~16 GB | 4.63 | 4-bit small |
Qwen3.6-28B-REAP20-A3B-Q3_K_L.gguf |
Q3_K_L | ~15 GB | 4.27 | 3-bit large |
Qwen3.6-28B-REAP20-A3B-Q3_K_M.gguf |
Q3_K_M | ~14 GB | 3.91 | 3-bit medium |
Qwen3.6-28B-REAP20-A3B-Q3_K_S.gguf |
Q3_K_S | ~13 GB | 3.66 | 3-bit small |
Qwen3.6-28B-REAP20-A3B-IQ3_XXS.gguf |
IQ3_XXS | ~12 GB | 3.06 | Ultra-small, imatrix-based |
Qwen3.6-28B-REAP20-A3B-Q2_K.gguf |
Q2_K | ~11 GB | 2.96 | Smallest size, lowest quality |
Model Details
| Property | Value |
|---|---|
| Architecture | Qwen3.6 MoE (hybrid Gated DeltaNet + MoE) |
| Parameters | ~28B total / ~3B active per token |
| Experts | 205 total / 8 active per token (pruned from 256) |
| Context Length | 262,144 tokens |
| Original dtype | BF16 |
| Quantization source | BF16 GGUF from 0xSero/Qwen3.6-28B-REAP20-A3B-GGUF |
| Quantization tool | llama.cpp |
| imatrix | Used for IQ3_XXS (from source repo) |
| License | Apache 2.0 |
Quantization Process
# 1. Download BF16 GGUF from source
huggingface-cli download 0xSero/Qwen3.6-28B-REAP20-A3B-GGUF \
--include "model.bf16.gguf" --local-dir ./
# 2. Download imatrix (for IQ quants)
huggingface-cli download 0xSero/Qwen3.6-28B-REAP20-A3B-GGUF \
--include "imatrix.dat" --local-dir ./
# 3. Quantize (example: Q4_K_M)
llama-quantize model.bf16.gguf Qwen3.6-28B-REAP20-A3B-Q4_K_M.gguf Q4_K_M
# 4. Quantize with imatrix (example: IQ3_XXS)
llama-quantize --imatrix imatrix.dat model.bf16.gguf \
Qwen3.6-28B-REAP20-A3B-IQ3_XXS.gguf IQ3_XXS
Usage
llama.cpp
llama-cli \
-m Qwen3.6-28B-REAP20-A3B-Q4_K_M.gguf \
-ngl 99 -c 4096 \
-p "Your prompt here"
llama-server (OpenAI-compatible API)
llama-server \
-m Qwen3.6-28B-REAP20-A3B-Q4_K_M.gguf \
-ngl 99 -c 4096 \
--port 8080
LM Studio / Jan / Ollama
Download the .gguf file and load it directly in your preferred local inference UI.
Hardware Requirements
| Config | VRAM / RAM |
|---|---|
| Full GPU (Q4_K_M, recommended) | 20+ GB VRAM |
| Hybrid CPU+GPU (Q4_K_M) | 10 GB VRAM + 10 GB RAM |
| CPU only (Q4_K_M) | 24+ GB RAM |
About the Original Model
0xSero/Qwen3.6-28B-REAP20-A3B applies REAP expert pruning (arXiv:2510.13999) to remove 20% of MoE experts (51 of 256 per layer) from Qwen3.6-35B-A3B, while preserving routing behavior via router weight renormalization. Active parameters per token remain unchanged at ~3B. The result is a ~25% smaller model with competitive generation quality across coding, reasoning, and knowledge benchmarks.
License
Apache 2.0 — see Qwen License.
- Downloads last month
- 1,998
2-bit
3-bit
4-bit
5-bit
6-bit
8-bit
16-bit
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp# Start a local OpenAI-compatible server: llama-server -hf barozp/Qwen3.6-28B-REAP20-A3B-GGUF: