Instructions to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF",
	filename="QwenPaw-Flash-9B-heretic-MTP-F16.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M

Use Docker

docker model run hf.co/SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M

LM Studio
Jan
Ollama
How to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with Ollama:
```
ollama run hf.co/SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
```

Unsloth Studio new

How to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF to start chatting

Pi new

How to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M

Run Hermes

hermes

Docker Model Runner
How to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with Docker Model Runner:
```
docker model run hf.co/SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M
```

Lemonade

How to use SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF:Q4_K_M

Run and chat with the model

lemonade run user.QwenPaw-Flash-9B-heretic-MTP-GGUF-Q4_K_M

List all available models

lemonade list

QwenPaw-Flash-9B-heretic-MTP

QwenPaw-Flash-9B-heretic non-MTP Version: QwenPaw-Flash-9B-heretic-GGUF

🏆 BenchLocal Total: 4035/5000 (80.7%) — MTP Speculative Decoding Injected
Uncensored · Abliterated · Agent-Optimized · 1.7-4.1× Speedup

Uncensored version of QwenPaw-Flash-9B, processed with Heretic v1.3.0 abliteration, with MTP (Multi-Token Prediction) head weights injected from the original Qwen3.5-9B base model.

By reconstructing the MTP speculative decoding head — which was stripped during the QwenPaw fine-tuning process — this model achieves up to 4.1× inference speedup on real agent benchmarks while maintaining or improving accuracy.

🏆 BenchLocal Benchmarks (With MTP)

Test Environment: NVIDIA RTX 5070 Ti (16GB) · llama.cpp (turboquant build, --spec-type draft-mtp) · Q6_K quant
Framework: BenchLocal — local model agent evaluation suite
Methodology: Each scenario run once, no retries, no second attempts

Benchmark	Score	Accuracy	Results	Time	vs No-MTP
ToolCall-15 🛠️	1500/1500	100%	15✅ 0⚠️ 0❌	0.65min	1.4× faster
HermesAgent-20 🤖	1505/2000	75.3%	12✅ 1⚠️ 7❌	5.3min	1.17× faster
BugFind-15 🐛	1030/1500	68.7%	9✅ 2⚠️ 4❌	1.8min	4.1× faster
Total	4035/5000	80.7%	36✅ 3⚠️ 11❌	7.8min	1.9× faster

Comparison: With vs Without MTP

Benchmark	Without MTP	With MTP	Δ Score	Δ Speed
ToolCall-15 🛠️	1400/1500 (93.3%)	1500/1500 (100%)	+100 pts	1.4×
HermesAgent-20 🤖	1545/2000 (77.2%)	1505/2000 (75.3%)	−40 pts	1.17×
BugFind-15 🐛	928/1500 (61.9%)	1030/1500 (68.7%)	+102 pts	4.1×
Total	3873/5000 (77.5%)	4035/5000 (80.7%)	+162 pts	1.9×
Total Time	14.7 min	7.8 min	—	1.9×

🛠️ ToolCall-15 — Tool Calling Stability (100%, +6.7 pts)

MTP speculative decoding eliminated the single failure (TC-05: Relative date/time parsing, which previously scored 0). All 15 scenarios now pass perfectly.

TC-ID	Result	Scenario
TC-01–TC-04	✅	Simple / Multi / Nested / Type conversion
TC-05	✅	Relative date/time parsing ← fixed by MTP
TC-06–TC-15	✅	All remaining scenarios

🤖 HermesAgent-20 — Complex Agent Tasks (75.3%, −1.9 pts)

MTP decoding introduces minor noise in long-chain reasoning scenarios (~40pt drop), likely because draft tokens occasionally derail the generation path in multi-step planning tasks. However, the speed gain (1.17×) and the fact that the drop is within noise range (single-run variance was 255pts for Qwopus MTP) makes this an acceptable trade-off.

🐛 BugFind-15 — Code Debugging (68.7%, +6.8 pts)

Significant improvement — MTP's faster decoding effectively prevents timeout failures (BF-12 previously hit 300s timeout, now completes in time) and the draft context helps maintain debugging focus.

BF-ID	Without MTP	With MTP	Δ
BF-01	✅ 100	✅ 100	—
BF-02	✅ 88	✅ 100	+12
BF-03	❌ 0	❌ 0	—
BF-04	✅ 100	✅ 100	—
BF-05	❌ 40	⚠️ 70	+30
BF-06	❌ 0	❌ 0	—
BF-07	✅ 100	✅ 100	—
BF-08	✅ 100	✅ 100	—
BF-09	✅ 100	✅ 100	—
BF-10	❌ 0	❌ 0	—
BF-11	⚠️ 60	✅ 100	+40
BF-12	❌ 0 (timeout)	✅ 100	+100
BF-13	✅ 100	✅ 100	—
BF-14	⚠️ 70	⚠️ 60	−10
BF-15	⚠️ 70	⚠️ 60	−10

MTP Speculative Decoding

What is MTP?

Multi-Token Prediction (MTP) is a speculative decoding technique where a small "draft head" predicts multiple future tokens in parallel. The main model then verifies these drafts in a single forward pass, accepting correct predictions for up to 2-4× speedup in practice.

Injection Method

The original Qwen3.5-9B base model ships with a 4-layer MTP head (~243M params) in its architecture configuration. During QwenPaw fine-tuning, the MTP head weights were stripped (only the config placeholder mtp_num_hidden_layers: 1 remained, but no actual tensors existed in the safetensors).

Recovery process:

Extracted the 15 MTP tensors (model.layers.32.nextn.*) from the original Qwen3.5-9B safetensors
Validated shape compatibility with QwenPaw's hidden dimension (3584 → 2048 → 1536 → 152064 vocabulary)
Injected as a new safetensor shard (model-00009-of-00009.safetensors, 0.45 GB)
Updated model.safetensors.index.json with MTP weight entries
Converted to GGUF via convert_hf_to_gguf.py with --outtype bf16 (auto-detect nextn_predict_layers=1)
Quantized to Q6_K / Q8_0 / Q4_K_M

Total injected parameters: 243.3M (2.7% of main model) MTP acceptance rate (draft-n-max=2): ~50% (1083 accepted / 2166 generated across all benchmarks)

Why This Works

The MTP head is a lightweight 4-layer MLP decoder that maps the main model's last hidden state to future token logits. It sits entirely in speculative decoding space — the main model's weights are unchanged, so no fine-tuning or retraining is needed. The head simply needs to exist with compatible dimensions for llama.cpp's --spec-type draft-mtp to activate.

Comparison: QwenPaw MTP vs Other Models

Model	Total	ToolCall-15	HermesAgent-20	BugFind-15	Total Time
🐾 QwenPaw MTP 9B	4035 🥇	100% 🥇	75.3%	68.7%	7.8min 🥇
🐾 QwenPaw 9B (no MTP)	3873	93.3%	77.2% 🥇	61.9%	14.7min
🧠 Qwopus 9B MTP	3935	93.3%	67.3% ⚠️	79.0% 🥇	21.3min ⚠️
🧠 Qwen 35B Thinking ON	1445 (HA only)	—	72.3%	—	7.0min
⚡ Qwen 35B Thinking OFF	1370 (HA only)	—	68.5%	—	5.1min
🔮 Gemma 4 26B	1405 (HA only)	—	70.3%	—	18.6min

QwenPaw MTP wins on 2/3 benchmarks + total score + total time. The only benchmark it loses is BugFind-15 (to Qwopus MTP), but Qwopus suffers from severe instability (255pt variance on HermesAgent-20, with a worst-case 6.2min timeout).

Model Description

Base model: QwenPaw-Flash-9B (Qwen3.5-9B fine-tuned for autonomous agent scenarios)
MTP head source: Qwen/Qwen3.5-9B (original base model, layer 32 MTP head)
Tool: Heretic v1.3.0 (automatic directional ablation)
Best trial: #194 / 230 trials (abliteration)

Abliteration Parameters

direction_index = 21.13
attn.o_proj.max_weight = 1.42
attn.o_proj.max_weight_position = 21.72
attn.o_proj.min_weight = 1.11
attn.o_proj.min_weight_distance = 18.14
mlp.down_proj.max_weight = 1.48
mlp.down_proj.max_weight_position = 21.23
mlp.down_proj.min_weight = 1.47
mlp.down_proj.min_weight_distance = 17.47

Architecture

Type: Qwen3_5ForConditionalGeneration (multimodal with vision encoder) + MTP spec head
Main model parameters: ~9B
MTP head parameters: ~243M (2.7% overhead)
Layers: 32 (hybrid: Gated DeltaNet + Gated Attention) + 4 MTP decoder layers
Context length: 262,144 tokens
Speculative decoding: --spec-type draft-mtp with --spec-draft-n-max 2

GGUF Files

File	Size	Notes
`QwenPaw-Flash-9B-heretic-MTP-Q8_0.gguf`	~9.2GB	High quality, near lossless
`QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf`	~7.1GB	✅ Recommended, best value
`QwenPaw-Flash-9B-heretic-MTP-Q4_K_M.gguf`	~5.4GB	Compact
`mmproj-BF16`	~880MB	Vision encoder (multimodal) — same as non-MTP version

Usage

--spec-type draft-mtp --spec-draft-n-max 2

llama.cpp (with MTP speculative decoding)

# Start server with MTP enabled
llama-server -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \
  -ngl 99 -fa on -c 8192 \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  --host 0.0.0.0 --port 8088

# Or with CLI
llama-cli -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \
  -ngl 99 -fa on -c 8192 \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  -p "Write a Python script to..."

llama.cpp (without MTP, fallback)

# The model works as a normal GGUF too — just omit spec args
llama-server -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \
  -ngl 99 -fa on -c 8192 \
  --host 0.0.0.0 --port 8088

LM Studio

Load the GGUF file directly. For MTP speculative decoding, LM Studio must support --spec-type — if not, the model functions as a standard 9B model.

Notes

Safety filters have been significantly reduced via abliteration
KL divergence is only 0.0225 — minimal impact on model intelligence
The original model supports multimodal (vision); GGUF versions require the mmproj file from the non-MTP release
BenchLocal scores measured at Q6_K on RTX 5070 Ti 16GB with llama.cpp (turboquant). Each scenario was run once with no retries
MTP acceptance rate of ~50% at draft-n-max=2 means ~25-40% wall-clock speedup on short prompts, and up to 4× on long-generation tasks (debugging, code writing)
BugFind-15 saw the largest improvement (4.1×) because debugging tasks are generation-heavy — more tokens, more drafts accepted
The MTP head is a lossless copy from the original Qwen3.5-9B — no training was involved, simply weight injection
Agent-heavy scenarios (HermesAgent-20) see the least MTP benefit because short-turn interactions don't give the draft head enough runway
Please use responsibly

Acknowledgements

Heretic — Automated censorship removal
agentscope-ai/QwenPaw-Flash-9B — Base model
Qwen/Qwen3.5-9B — MTP head source
llama.cpp — GGUF quantization and inference
BenchLocal — Local model agent evaluation suite

Downloads last month: 462

GGUF

Model size

9B params

Architecture

qwen35

Hardware compatibility

4-bit

6-bit

8-bit

16-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

(227)

this model