How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="0xSero/Qwen3.6-28B-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Support this work → · X · GitHub · REAP paper · Cerebras REAP

Qwen3.6-28B-GGUF

GGUF quantization of the base model.

At a glance

Base model —
Format GGUF
Total params 28B
Active / token 3B
Experts / layer —
Layers —
Hidden size —
Context —
On-disk size 147 GB

Which variant should I pick?

Variant Format Link
Qwen3.6-28B BF16 link
Qwen3.6-28B-GGUF (this) GGUF link
Qwen3.6-35B-GGUF GGUF link

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Downloads last month
657
GGUF
Model size
28B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including 0xSero/Qwen3.6-28B-GGUF

Paper for 0xSero/Qwen3.6-28B-GGUF