Instructions to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF", filename="QwenPaw-Flash-9B-heretic-MTP-F16.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: llama-cli -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
Use Docker
docker model run hf.co/SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
- LM Studio
- Jan
- Ollama
How to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with Ollama:
ollama run hf.co/SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
- Unsloth Studio new
How to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF to start chatting
- Pi new
How to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with Docker Model Runner:
docker model run hf.co/SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
- Lemonade
How to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
Run and chat with the model
lemonade run user.QwenPaw-Flash-9B-heretic-MTP-GGUF-Q4_K_M
List all available models
lemonade list
QwenPaw-Flash-9B-heretic-MTP
QwenPaw-Flash-9B-heretic non-MTP Version: QwenPaw-Flash-9B-heretic-GGUF
🏆 BenchLocal Total: 4035/5000 (80.7%) — MTP Speculative Decoding Injected
Uncensored · Abliterated · Agent-Optimized · 1.7-4.1× Speedup
Uncensored version of QwenPaw-Flash-9B, processed with Heretic v1.3.0 abliteration, with MTP (Multi-Token Prediction) head weights injected from the original Qwen3.5-9B base model.
By reconstructing the MTP speculative decoding head — which was stripped during the QwenPaw fine-tuning process — this model achieves up to 4.1× inference speedup on real agent benchmarks while maintaining or improving accuracy.
🏆 BenchLocal Benchmarks (With MTP)
Test Environment: NVIDIA RTX 5070 Ti (16GB) · llama.cpp (turboquant build,
--spec-type draft-mtp) · Q6_K quant
Framework: BenchLocal — local model agent evaluation suite
Methodology: Each scenario run once, no retries, no second attempts
| Benchmark | Score | Accuracy | Results | Time | vs No-MTP |
|---|---|---|---|---|---|
| ToolCall-15 🛠️ | 1500/1500 | 100% | 15✅ 0⚠️ 0❌ | 0.65min | 1.4× faster |
| HermesAgent-20 🤖 | 1505/2000 | 75.3% | 12✅ 1⚠️ 7❌ | 5.3min | 1.17× faster |
| BugFind-15 🐛 | 1030/1500 | 68.7% | 9✅ 2⚠️ 4❌ | 1.8min | 4.1× faster |
| Total | 4035/5000 | 80.7% | 36✅ 3⚠️ 11❌ | 7.8min | 1.9× faster |
Comparison: With vs Without MTP
| Benchmark | Without MTP | With MTP | Δ Score | Δ Speed |
|---|---|---|---|---|
| ToolCall-15 🛠️ | 1400/1500 (93.3%) | 1500/1500 (100%) | +100 pts | 1.4× |
| HermesAgent-20 🤖 | 1545/2000 (77.2%) | 1505/2000 (75.3%) | −40 pts | 1.17× |
| BugFind-15 🐛 | 928/1500 (61.9%) | 1030/1500 (68.7%) | +102 pts | 4.1× |
| Total | 3873/5000 (77.5%) | 4035/5000 (80.7%) | +162 pts | 1.9× |
| Total Time | 14.7 min | 7.8 min | — | 1.9× |
🛠️ ToolCall-15 — Tool Calling Stability (100%, +6.7 pts)
MTP speculative decoding eliminated the single failure (TC-05: Relative date/time parsing, which previously scored 0). All 15 scenarios now pass perfectly.
| TC-ID | Result | Scenario |
|---|---|---|
| TC-01–TC-04 | ✅ | Simple / Multi / Nested / Type conversion |
| TC-05 | ✅ | Relative date/time parsing ← fixed by MTP |
| TC-06–TC-15 | ✅ | All remaining scenarios |
🤖 HermesAgent-20 — Complex Agent Tasks (75.3%, −1.9 pts)
MTP decoding introduces minor noise in long-chain reasoning scenarios (~40pt drop), likely because draft tokens occasionally derail the generation path in multi-step planning tasks. However, the speed gain (1.17×) and the fact that the drop is within noise range (single-run variance was 255pts for Qwopus MTP) makes this an acceptable trade-off.
🐛 BugFind-15 — Code Debugging (68.7%, +6.8 pts)
Significant improvement — MTP's faster decoding effectively prevents timeout failures (BF-12 previously hit 300s timeout, now completes in time) and the draft context helps maintain debugging focus.
| BF-ID | Without MTP | With MTP | Δ |
|---|---|---|---|
| BF-01 | ✅ 100 | ✅ 100 | — |
| BF-02 | ✅ 88 | ✅ 100 | +12 |
| BF-03 | ❌ 0 | ❌ 0 | — |
| BF-04 | ✅ 100 | ✅ 100 | — |
| BF-05 | ❌ 40 | ⚠️ 70 | +30 |
| BF-06 | ❌ 0 | ❌ 0 | — |
| BF-07 | ✅ 100 | ✅ 100 | — |
| BF-08 | ✅ 100 | ✅ 100 | — |
| BF-09 | ✅ 100 | ✅ 100 | — |
| BF-10 | ❌ 0 | ❌ 0 | — |
| BF-11 | ⚠️ 60 | ✅ 100 | +40 |
| BF-12 | ❌ 0 (timeout) | ✅ 100 | +100 |
| BF-13 | ✅ 100 | ✅ 100 | — |
| BF-14 | ⚠️ 70 | ⚠️ 60 | −10 |
| BF-15 | ⚠️ 70 | ⚠️ 60 | −10 |
MTP Speculative Decoding
What is MTP?
Multi-Token Prediction (MTP) is a speculative decoding technique where a small "draft head" predicts multiple future tokens in parallel. The main model then verifies these drafts in a single forward pass, accepting correct predictions for up to 2-4× speedup in practice.
Injection Method
The original Qwen3.5-9B base model ships with a 4-layer MTP head (~243M params) in its architecture configuration. During QwenPaw fine-tuning, the MTP head weights were stripped (only the config placeholder mtp_num_hidden_layers: 1 remained, but no actual tensors existed in the safetensors).
Recovery process:
- Extracted the 15 MTP tensors (
model.layers.32.nextn.*) from the original Qwen3.5-9B safetensors - Validated shape compatibility with QwenPaw's hidden dimension (3584 → 2048 → 1536 → 152064 vocabulary)
- Injected as a new safetensor shard (
model-00009-of-00009.safetensors, 0.45 GB) - Updated
model.safetensors.index.jsonwith MTP weight entries - Converted to GGUF via
convert_hf_to_gguf.pywith--outtype bf16(auto-detectnextn_predict_layers=1) - Quantized to Q6_K / Q8_0 / Q4_K_M
Total injected parameters: 243.3M (2.7% of main model) MTP acceptance rate (draft-n-max=2): ~50% (1083 accepted / 2166 generated across all benchmarks)
Why This Works
The MTP head is a lightweight 4-layer MLP decoder that maps the main model's last hidden state to future token logits. It sits entirely in speculative decoding space — the main model's weights are unchanged, so no fine-tuning or retraining is needed. The head simply needs to exist with compatible dimensions for llama.cpp's --spec-type draft-mtp to activate.
Comparison: QwenPaw MTP vs Other Models
| Model | Total | ToolCall-15 | HermesAgent-20 | BugFind-15 | Total Time |
|---|---|---|---|---|---|
| 🐾 QwenPaw MTP 9B | 4035 🥇 | 100% 🥇 | 75.3% | 68.7% | 7.8min 🥇 |
| 🐾 QwenPaw 9B (no MTP) | 3873 | 93.3% | 77.2% 🥇 | 61.9% | 14.7min |
| 🧠 Qwopus 9B MTP | 3935 | 93.3% | 67.3% ⚠️ | 79.0% 🥇 | 21.3min ⚠️ |
| 🧠 Qwen 35B Thinking ON | 1445 (HA only) | — | 72.3% | — | 7.0min |
| ⚡ Qwen 35B Thinking OFF | 1370 (HA only) | — | 68.5% | — | 5.1min |
| 🔮 Gemma 4 26B | 1405 (HA only) | — | 70.3% | — | 18.6min |
QwenPaw MTP wins on 2/3 benchmarks + total score + total time. The only benchmark it loses is BugFind-15 (to Qwopus MTP), but Qwopus suffers from severe instability (255pt variance on HermesAgent-20, with a worst-case 6.2min timeout).
Model Description
- Base model: QwenPaw-Flash-9B (Qwen3.5-9B fine-tuned for autonomous agent scenarios)
- MTP head source: Qwen/Qwen3.5-9B (original base model, layer 32 MTP head)
- Tool: Heretic v1.3.0 (automatic directional ablation)
- Best trial: #194 / 230 trials (abliteration)
Abliteration Parameters
direction_index = 21.13
attn.o_proj.max_weight = 1.42
attn.o_proj.max_weight_position = 21.72
attn.o_proj.min_weight = 1.11
attn.o_proj.min_weight_distance = 18.14
mlp.down_proj.max_weight = 1.48
mlp.down_proj.max_weight_position = 21.23
mlp.down_proj.min_weight = 1.47
mlp.down_proj.min_weight_distance = 17.47
Architecture
- Type: Qwen3_5ForConditionalGeneration (multimodal with vision encoder) + MTP spec head
- Main model parameters: ~9B
- MTP head parameters: ~243M (2.7% overhead)
- Layers: 32 (hybrid: Gated DeltaNet + Gated Attention) + 4 MTP decoder layers
- Context length: 262,144 tokens
- Speculative decoding:
--spec-type draft-mtpwith--spec-draft-n-max 2
GGUF Files
| File | Size | Notes |
|---|---|---|
QwenPaw-Flash-9B-heretic-MTP-Q8_0.gguf |
~9.2GB | High quality, near lossless |
QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf |
~7.1GB | ✅ Recommended, best value |
QwenPaw-Flash-9B-heretic-MTP-Q4_K_M.gguf |
~5.4GB | Compact |
mmproj-BF16 |
~880MB | Vision encoder (multimodal) — same as non-MTP version |
Usage
--spec-type draft-mtp --spec-draft-n-max 2
llama.cpp (with MTP speculative decoding)
# Start server with MTP enabled
llama-server -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \
-ngl 99 -fa on -c 8192 \
--spec-type draft-mtp --spec-draft-n-max 2 \
--host 0.0.0.0 --port 8088
# Or with CLI
llama-cli -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \
-ngl 99 -fa on -c 8192 \
--spec-type draft-mtp --spec-draft-n-max 2 \
-p "Write a Python script to..."
llama.cpp (without MTP, fallback)
# The model works as a normal GGUF too — just omit spec args
llama-server -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \
-ngl 99 -fa on -c 8192 \
--host 0.0.0.0 --port 8088
LM Studio
Load the GGUF file directly. For MTP speculative decoding, LM Studio must support --spec-type — if not, the model functions as a standard 9B model.
Notes
- Safety filters have been significantly reduced via abliteration
- KL divergence is only 0.0225 — minimal impact on model intelligence
- The original model supports multimodal (vision); GGUF versions require the mmproj file from the non-MTP release
- BenchLocal scores measured at Q6_K on RTX 5070 Ti 16GB with llama.cpp (turboquant). Each scenario was run once with no retries
- MTP acceptance rate of ~50% at draft-n-max=2 means ~25-40% wall-clock speedup on short prompts, and up to 4× on long-generation tasks (debugging, code writing)
- BugFind-15 saw the largest improvement (4.1×) because debugging tasks are generation-heavy — more tokens, more drafts accepted
- The MTP head is a lossless copy from the original Qwen3.5-9B — no training was involved, simply weight injection
- Agent-heavy scenarios (HermesAgent-20) see the least MTP benefit because short-turn interactions don't give the draft head enough runway
- Please use responsibly
Acknowledgements
- Heretic — Automated censorship removal
- agentscope-ai/QwenPaw-Flash-9B — Base model
- Qwen/Qwen3.5-9B — MTP head source
- llama.cpp — GGUF quantization and inference
- BenchLocal — Local model agent evaluation suite
- Downloads last month
- 462
4-bit
6-bit
8-bit
16-bit