Instructions to use 0xSero/Qwen3.5-99B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use 0xSero/Qwen3.5-99B-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="0xSero/Qwen3.5-99B-GGUF",
	filename="Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use 0xSero/Qwen3.5-99B-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M

Use Docker

docker model run hf.co/0xSero/Qwen3.5-99B-GGUF:Q4_K_M

LM Studio
Jan

vLLM

How to use 0xSero/Qwen3.5-99B-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "0xSero/Qwen3.5-99B-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "0xSero/Qwen3.5-99B-GGUF",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/0xSero/Qwen3.5-99B-GGUF:Q4_K_M

Ollama
How to use 0xSero/Qwen3.5-99B-GGUF with Ollama:
```
ollama run hf.co/0xSero/Qwen3.5-99B-GGUF:Q4_K_M
```

Unsloth Studio

How to use 0xSero/Qwen3.5-99B-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for 0xSero/Qwen3.5-99B-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for 0xSero/Qwen3.5-99B-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for 0xSero/Qwen3.5-99B-GGUF to start chatting

How to use 0xSero/Qwen3.5-99B-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "0xSero/Qwen3.5-99B-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use 0xSero/Qwen3.5-99B-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default 0xSero/Qwen3.5-99B-GGUF:Q4_K_M

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use 0xSero/Qwen3.5-99B-GGUF with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "0xSero/Qwen3.5-99B-GGUF:Q4_K_M" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use 0xSero/Qwen3.5-99B-GGUF with Docker Model Runner:
```
docker model run hf.co/0xSero/Qwen3.5-99B-GGUF:Q4_K_M
```

Lemonade

How to use 0xSero/Qwen3.5-99B-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull 0xSero/Qwen3.5-99B-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.Qwen3.5-99B-GGUF-Q4_K_M

List all available models

lemonade list

Support this work → · X · GitHub · REAP paper · Cerebras REAP

Qwen3.5-99B-GGUF

GGUF quantization of 0xSero/Qwen3.5-99B.

At a glance


Base model	0xSero/Qwen3.5-99B
Format	GGUF
Total params	99B
Active / token	10B
Experts / layer	—
Layers	—
Hidden size	—
Context	—
On-disk size	247 GB

Which variant should I pick?

Variant	Format	Link
`Qwen3.5-264B`	BF16	link
`Qwen3.5-264B-FP8`	FP8	link
`Qwen3.5-264B-W4A16`	W4A16	link
`Qwen3.5-28B`	BF16	link
`Qwen3.5-35B-EXL3-4bpw`	EXL3-4bpw	link
`Qwen3.5-76B`	BF16	link
`Qwen3.5-76B-GGUF`	GGUF	link
`Qwen3.5-88B`	BF16	link
`Qwen3.5-99B`	BF16	link
`Qwen3.5-99B-GGUF` (this)	GGUF	link

GGUF quantizations of 0xSero/Qwen3.5-99B, a 20% expert-pruned Qwen3.5-122B MoE model using REAP.

Available Quantizations

File	Quant	BPW	Size	Description
`Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf`	Q4_K_M	4.86	57 GB	Best speed-to-quality ratio. Fits in 64 GB GTT.
`Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf`	Q6_K	6.57	76 GB	Higher quality. Needs 80+ GB VRAM/GTT.
`Qwen3.5-122B-A10B-REAP-20-Q8_0.gguf`	Q8_0	8.51	99 GB	Near-lossless. Needs 100+ GB VRAM/GTT.

Model Details

Property	Value
Base Model	Qwen3.5-122B-A10B
Pruned Model	0xSero/Qwen3.5-99B
Architecture	Qwen3.5 MoE (GDN + Full Attention hybrid)
Total Parameters	99B (205 experts/layer, down from 256)
Active Parameters	~10B per token (8 experts selected)
Context Length	262,144 tokens
Thinking Mode	Yes (reasoning_content in chat completions)
Pruning Method	REAP — 20% expert removal with super-expert protection
Quantization Tool	llama.cpp (llama-quantize)
Converted From	Safetensors (BF16) via llama.cpp convert_hf_to_gguf.py

Speed Benchmarks

Tested on AMD Ryzen AI MAX+ 395 (Strix Halo), Radeon 8060S (gfx1151), 128 GB LPDDR5X. llama.cpp b8746, Vulkan RADV, Flash Attention ON.

llama-bench (pp512 / tg128)

Quant	GPU Layers	Prefill (t/s)	Token Gen (t/s)
Q4_K_M	49/49 (full)	295.74	27.56
Q6_K	35/49 (partial)	121.35	15.74
Q8_0	25/49 (partial)	44.55	9.89

API Speed (llama-server, real chat completions)

Quant	Prefill (short)	Prefill (long)	Token Gen
Q4_K_M	141.8 t/s	62.3 t/s	28.4 t/s
Q6_K	48.8 t/s	21.7 t/s	15.4 t/s
Q8_0	25.8 t/s	14.2 t/s	9.0 t/s

Q6_K and Q8_0 are partially offloaded to CPU because they exceed the default 64 GB GTT limit. With GTT increased to 120 GB (BIOS GART + modprobe config), they would run at full GPU speed.

Quality Benchmarks

Tested via llama-server API with thinking mode enabled.

Reasoning (5 questions — math, calculus, logic, code comprehension, knowledge)

Quant	Score
Q4_K_M	5/5
Q6_K	5/5
Q8_0	5/5

All quants produce correct answers for arithmetic (127*43=5461), calculus (derivative of x^3+2x^2-5x+7), formal logic, Python reference semantics, and factual recall.

Code Generation (HumanEval subset — 5 problems, executed and tested)

Quant	Passed
Q4_K_M	4/5
Q6_K	4/5
Q8_0	3/5

The model generates correct code for all problems. Score differences are due to code extraction from the thinking format, not model quality.

Full Benchmarks (safetensors, from base model card)

Benchmark	Score
HumanEval	81.1%
HumanEval+	76.8%
MBPP	86.2%
MBPP+	73.0%
ARC Challenge	63.7%
HellaSwag	84.1%
TruthfulQA MC2	52.4%
Winogrande	75.5%

See the full model card for complete benchmark results and methodology.

How to Run

llama-server (recommended)

# Q4_K_M — fits in 64 GB, fastest
llama-server \
  -m Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf \
  -ngl 999 --flash-attn on -c 4096 \
  --port 8080 --host 0.0.0.0

# With speculative decoding for faster generation
llama-server \
  -m Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf \
  -ngl 999 --flash-attn on -c 4096 \
  --spec-type ngram-mod --spec-ngram-size-n 24 \
  --draft-min 48 --draft-max 64 \
  --port 8080 --host 0.0.0.0

Ollama

# Create a Modelfile
echo 'FROM ./Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf' > Modelfile
ollama create reap20 -f Modelfile
ollama run reap20

Python (llama-cpp-python)

from llama_cpp import Llama

llm = Llama(
    model_path="Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf",
    n_gpu_layers=-1,
    n_ctx=4096,
    flash_attn=True,
)

output = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=512,
)
print(output["choices"][0]["message"]["content"])

Which Quant Should I Use?

Your Setup	Recommended
64 GB VRAM/GTT (e.g., Strix Halo default)	Q4_K_M — full GPU offload, 28 t/s
80-96 GB VRAM/GTT	Q6_K — higher quality, full GPU offload
128+ GB VRAM (e.g., 2x Strix Halo cluster, A100)	Q8_0 — near-lossless quality
RTX 4090 (24 GB)	Model too large. Use a smaller model.

Hardware Notes

This model was designed for and tested on AMD Strix Halo (Ryzen AI MAX+ 395) with 128 GB unified memory. It also works on any system with sufficient VRAM/RAM:

Strix Halo (64 GB GTT default): Q4_K_M fits fully, Q6_K/Q8_0 partial offload
Strix Halo (120 GB GTT increased): All quants fit fully
2x Strix Halo cluster (RPC): All quants at full speed
NVIDIA A100 80GB: Q4_K_M and Q6_K fit fully
Apple M-series (128 GB): All quants should work via Metal

What is REAP?

REAP (Routing-Enhanced Activation Pruning) removes the least-activated experts from Mixture-of-Experts models while preserving critical capabilities. This model has 20% of experts removed (256 -> 205 per layer), retaining 97.9% average capability across standard benchmarks.

Credits

Pruning: 0xSero / Sybil Solutions
Base Model: Qwen Team
REAP Method: arxiv:2510.13999
Quantization: llama.cpp

License

Same license as the base model. See Qwen3.5-122B-A10B license.

License & citation

License inherited from the base model.

@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Model tree for 0xSero/Qwen3.5-99B-GGUF

Base model

Qwen/Qwen3.5-122B-A10B

Finetuned

0xSero/Qwen3.5-99B

Quantized

(6)

this model

Collections including 0xSero/Qwen3.5-99B-GGUF

Proven REAPs

Collection

Benchmarked REAP checkpoints with >=500 all-time downloads. GLM/Qwen/MiniMax/DeepSeek/Kimi/gemma. • 20 items • Updated 7 days ago • 10

Qwen — REAP

Collection

REAP-pruned & quantized Qwen3.5 / 3.6 / Coder variants. • 15 items • Updated 7 days ago

Paper for 0xSero/Qwen3.5-99B-GGUF

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression

Paper • 2510.13999 • Published Oct 15, 2025 • 20

0xSero
/

Qwen3.5-99B-GGUF

Qwen3.5-99B-GGUF

At a glance

Which variant should I pick?

Available Quantizations

Model Details

Speed Benchmarks

llama-bench (pp512 / tg128)

API Speed (llama-server, real chat completions)

Quality Benchmarks

Reasoning (5 questions — math, calculus, logic, code comprehension, knowledge)

Code Generation (HumanEval subset — 5 problems, executed and tested)

Full Benchmarks (safetensors, from base model card)

How to Run

llama-server (recommended)

Ollama

Python (llama-cpp-python)

Which Quant Should I Use?

Hardware Notes

What is REAP?

Credits

License

License & citation

Sponsors

Model tree for 0xSero/Qwen3.5-99B-GGUF

Collections including 0xSero/Qwen3.5-99B-GGUF

Proven REAPs

Qwen — REAP

Paper for 0xSero/Qwen3.5-99B-GGUF

REAP the Experts: Why Pruning Prevails for One-Shot MoE compression