Instructions to use 0xSero/Qwen3.5-99B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use 0xSero/Qwen3.5-99B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="0xSero/Qwen3.5-99B-GGUF", filename="Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use 0xSero/Qwen3.5-99B-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M # Run inference directly in the terminal: llama cli -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M # Run inference directly in the terminal: llama cli -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M
Use Docker
docker model run hf.co/0xSero/Qwen3.5-99B-GGUF:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use 0xSero/Qwen3.5-99B-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "0xSero/Qwen3.5-99B-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "0xSero/Qwen3.5-99B-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/0xSero/Qwen3.5-99B-GGUF:Q4_K_M
- Ollama
How to use 0xSero/Qwen3.5-99B-GGUF with Ollama:
ollama run hf.co/0xSero/Qwen3.5-99B-GGUF:Q4_K_M
- Unsloth Studio
How to use 0xSero/Qwen3.5-99B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for 0xSero/Qwen3.5-99B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for 0xSero/Qwen3.5-99B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for 0xSero/Qwen3.5-99B-GGUF to start chatting
- Pi
How to use 0xSero/Qwen3.5-99B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "0xSero/Qwen3.5-99B-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use 0xSero/Qwen3.5-99B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default 0xSero/Qwen3.5-99B-GGUF:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- OpenClaw new
How to use 0xSero/Qwen3.5-99B-GGUF with OpenClaw:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf 0xSero/Qwen3.5-99B-GGUF:Q4_K_M
Configure OpenClaw
# Install OpenClaw: npm install -g openclaw@latest # Register the local server and set it as the default model: openclaw onboard --non-interactive --mode local \ --auth-choice custom-api-key \ --custom-base-url http://127.0.0.1:8080/v1 \ --custom-model-id "0xSero/Qwen3.5-99B-GGUF:Q4_K_M" \ --custom-provider-id llama-cpp \ --custom-compatibility openai \ --custom-text-input \ --accept-risk \ --skip-health
Run OpenClaw
openclaw agent --local --agent main --message "Hello from Hugging Face"
- Docker Model Runner
How to use 0xSero/Qwen3.5-99B-GGUF with Docker Model Runner:
docker model run hf.co/0xSero/Qwen3.5-99B-GGUF:Q4_K_M
- Lemonade
How to use 0xSero/Qwen3.5-99B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull 0xSero/Qwen3.5-99B-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3.5-99B-GGUF-Q4_K_M
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)Support this work → · X · GitHub · REAP paper · Cerebras REAP
Qwen3.5-99B-GGUF
GGUF quantization of 0xSero/Qwen3.5-99B.
At a glance
| Base model | 0xSero/Qwen3.5-99B |
| Format | GGUF |
| Total params | 99B |
| Active / token | 10B |
| Experts / layer | — |
| Layers | — |
| Hidden size | — |
| Context | — |
| On-disk size | 247 GB |
Which variant should I pick?
| Variant | Format | Link |
|---|---|---|
Qwen3.5-264B |
BF16 | link |
Qwen3.5-264B-FP8 |
FP8 | link |
Qwen3.5-264B-W4A16 |
W4A16 | link |
Qwen3.5-28B |
BF16 | link |
Qwen3.5-35B-EXL3-4bpw |
EXL3-4bpw | link |
Qwen3.5-76B |
BF16 | link |
Qwen3.5-76B-GGUF |
GGUF | link |
Qwen3.5-88B |
BF16 | link |
Qwen3.5-99B |
BF16 | link |
Qwen3.5-99B-GGUF (this) |
GGUF | link |
GGUF quantizations of 0xSero/Qwen3.5-99B, a 20% expert-pruned Qwen3.5-122B MoE model using REAP.
Available Quantizations
| File | Quant | BPW | Size | Description |
|---|---|---|---|---|
Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf |
Q4_K_M | 4.86 | 57 GB | Best speed-to-quality ratio. Fits in 64 GB GTT. |
Qwen3.5-122B-A10B-REAP-20-Q6_K.gguf |
Q6_K | 6.57 | 76 GB | Higher quality. Needs 80+ GB VRAM/GTT. |
Qwen3.5-122B-A10B-REAP-20-Q8_0.gguf |
Q8_0 | 8.51 | 99 GB | Near-lossless. Needs 100+ GB VRAM/GTT. |
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen3.5-122B-A10B |
| Pruned Model | 0xSero/Qwen3.5-99B |
| Architecture | Qwen3.5 MoE (GDN + Full Attention hybrid) |
| Total Parameters | 99B (205 experts/layer, down from 256) |
| Active Parameters | ~10B per token (8 experts selected) |
| Context Length | 262,144 tokens |
| Thinking Mode | Yes (reasoning_content in chat completions) |
| Pruning Method | REAP — 20% expert removal with super-expert protection |
| Quantization Tool | llama.cpp (llama-quantize) |
| Converted From | Safetensors (BF16) via llama.cpp convert_hf_to_gguf.py |
Speed Benchmarks
Tested on AMD Ryzen AI MAX+ 395 (Strix Halo), Radeon 8060S (gfx1151), 128 GB LPDDR5X. llama.cpp b8746, Vulkan RADV, Flash Attention ON.
llama-bench (pp512 / tg128)
| Quant | GPU Layers | Prefill (t/s) | Token Gen (t/s) |
|---|---|---|---|
| Q4_K_M | 49/49 (full) | 295.74 | 27.56 |
| Q6_K | 35/49 (partial) | 121.35 | 15.74 |
| Q8_0 | 25/49 (partial) | 44.55 | 9.89 |
API Speed (llama-server, real chat completions)
| Quant | Prefill (short) | Prefill (long) | Token Gen |
|---|---|---|---|
| Q4_K_M | 141.8 t/s | 62.3 t/s | 28.4 t/s |
| Q6_K | 48.8 t/s | 21.7 t/s | 15.4 t/s |
| Q8_0 | 25.8 t/s | 14.2 t/s | 9.0 t/s |
Q6_K and Q8_0 are partially offloaded to CPU because they exceed the default 64 GB GTT limit. With GTT increased to 120 GB (BIOS GART + modprobe config), they would run at full GPU speed.
Quality Benchmarks
Tested via llama-server API with thinking mode enabled.
Reasoning (5 questions — math, calculus, logic, code comprehension, knowledge)
| Quant | Score |
|---|---|
| Q4_K_M | 5/5 |
| Q6_K | 5/5 |
| Q8_0 | 5/5 |
All quants produce correct answers for arithmetic (127*43=5461), calculus (derivative of x^3+2x^2-5x+7), formal logic, Python reference semantics, and factual recall.
Code Generation (HumanEval subset — 5 problems, executed and tested)
| Quant | Passed |
|---|---|
| Q4_K_M | 4/5 |
| Q6_K | 4/5 |
| Q8_0 | 3/5 |
The model generates correct code for all problems. Score differences are due to code extraction from the thinking format, not model quality.
Full Benchmarks (safetensors, from base model card)
| Benchmark | Score |
|---|---|
| HumanEval | 81.1% |
| HumanEval+ | 76.8% |
| MBPP | 86.2% |
| MBPP+ | 73.0% |
| ARC Challenge | 63.7% |
| HellaSwag | 84.1% |
| TruthfulQA MC2 | 52.4% |
| Winogrande | 75.5% |
See the full model card for complete benchmark results and methodology.
How to Run
llama-server (recommended)
# Q4_K_M — fits in 64 GB, fastest
llama-server \
-m Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf \
-ngl 999 --flash-attn on -c 4096 \
--port 8080 --host 0.0.0.0
# With speculative decoding for faster generation
llama-server \
-m Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf \
-ngl 999 --flash-attn on -c 4096 \
--spec-type ngram-mod --spec-ngram-size-n 24 \
--draft-min 48 --draft-max 64 \
--port 8080 --host 0.0.0.0
Ollama
# Create a Modelfile
echo 'FROM ./Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf' > Modelfile
ollama create reap20 -f Modelfile
ollama run reap20
Python (llama-cpp-python)
from llama_cpp import Llama
llm = Llama(
model_path="Qwen3.5-122B-A10B-REAP-20-Q4_K_M.gguf",
n_gpu_layers=-1,
n_ctx=4096,
flash_attn=True,
)
output = llm.create_chat_completion(
messages=[{"role": "user", "content": "Hello!"}],
max_tokens=512,
)
print(output["choices"][0]["message"]["content"])
Which Quant Should I Use?
| Your Setup | Recommended |
|---|---|
| 64 GB VRAM/GTT (e.g., Strix Halo default) | Q4_K_M — full GPU offload, 28 t/s |
| 80-96 GB VRAM/GTT | Q6_K — higher quality, full GPU offload |
| 128+ GB VRAM (e.g., 2x Strix Halo cluster, A100) | Q8_0 — near-lossless quality |
| RTX 4090 (24 GB) | Model too large. Use a smaller model. |
Hardware Notes
This model was designed for and tested on AMD Strix Halo (Ryzen AI MAX+ 395) with 128 GB unified memory. It also works on any system with sufficient VRAM/RAM:
- Strix Halo (64 GB GTT default): Q4_K_M fits fully, Q6_K/Q8_0 partial offload
- Strix Halo (120 GB GTT increased): All quants fit fully
- 2x Strix Halo cluster (RPC): All quants at full speed
- NVIDIA A100 80GB: Q4_K_M and Q6_K fit fully
- Apple M-series (128 GB): All quants should work via Metal
What is REAP?
REAP (Routing-Enhanced Activation Pruning) removes the least-activated experts from Mixture-of-Experts models while preserving critical capabilities. This model has 20% of experts removed (256 -> 205 per layer), retaining 97.9% average capability across standard benchmarks.
Credits
- Pruning: 0xSero / Sybil Solutions
- Base Model: Qwen Team
- REAP Method: arxiv:2510.13999
- Quantization: llama.cpp
License
Same license as the base model. See Qwen3.5-122B-A10B license.
License & citation
License inherited from the base model.
@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
- Downloads last month
- 335
4-bit
6-bit
8-bit
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="0xSero/Qwen3.5-99B-GGUF", filename="", )