QwenPaw-Flash-9B-heretic non-MTP Version: QwenPaw-Flash-9B-heretic-GGUF

🏆 BenchLocal Total: 4035/5000 (80.7%) — MTP Speculative Decoding Injected

Uncensored · Abliterated · Agent-Optimized · 1.7-4.1× Speedup

Uncensored version of QwenPaw-Flash-9B, processed with Heretic v1.3.0 abliteration, with MTP (Multi-Token Prediction) head weights injected from the original Qwen3.5-9B base model.

By reconstructing the MTP speculative decoding head — which was stripped during the QwenPaw fine-tuning process — this model achieves up to 4.1× inference speedup on real agent benchmarks while maintaining or improving accuracy.

📊 🏆 BenchLocal Benchmarks (With MTP)

Test Environment: NVIDIA RTX 5070 Ti (16GB) · llama.cpp (turboquant build, --spec-type draft-mtp) · Q6_K quant

Framework: BenchLocal — local model agent evaluation suite

Methodology: Each scenario run once, no retries, no second attempts

Benchmark	Score	Accuracy	Results	Time	vs No-MTP
ToolCall-15 🛠️	1500/1500	100%	15✅ 0⚠️ 0❌	0.65min	1.4× faster
HermesAgent-20 🤖	1505/2000	75.3%	12✅ 1⚠️ 7❌	5.3min	1.17× faster
BugFind-15 🐛	1030/1500	68.7%	9✅ 2⚠️ 4❌	1.8min	4.1× faster
Total	4035/5000	80.7%	36✅ 3⚠️ 11❌	7.8min	1.9× faster

Comparison: With vs Without MTP

Benchmark	Without MTP	With MTP	Δ Score	Δ Speed
ToolCall-15 🛠️	1400/1500 (93.3%)	1500/1500 (100%)	+100 pts	1.4×
HermesAgent-20 🤖	1545/2000 (77.2%)	1505/2000 (75.3%)	−40 pts	1.17×
BugFind-15 🐛	928/1500 (61.9%)	1030/1500 (68.7%)	+102 pts	4.1×
Total	3873/5000 (77.5%)	4035/5000 (80.7%)	+162 pts	1.9×
Total Time	14.7 min	7.8 min	—	1.9×

🛠️ ToolCall-15 — Tool Calling Stability (100%, +6.7 pts)

MTP speculative decoding eliminated the single failure (TC-05: Relative date/time parsing, which previously scored 0). All 15 scenarios now pass perfectly.

TC-ID	Result	Scenario
TC-01–TC-04	✅	Simple / Multi / Nested / Type conversion
TC-05	✅	Relative date/time parsing ← fixed by MTP
TC-06–TC-15	✅	All remaining scenarios

🤖 HermesAgent-20 — Complex Agent Tasks (75.3%, −1.9 pts)

MTP decoding introduces minor noise in long-chain reasoning scenarios (~40pt drop), likely because draft tokens occasionally derail the generation path in multi-step planning tasks. However, the speed gain (1.17×) and the fact that the drop is within noise range (single-run variance was 255pts for Qwopus MTP) makes this an acceptable trade-off.

🐛 BugFind-15 — Code Debugging (68.7%, +6.8 pts)

Significant improvement — MTP's faster decoding effectively prevents timeout failures (BF-12 previously hit 300s timeout, now completes in time) and the draft context helps maintain debugging focus.

BF-ID	Without MTP	With MTP	Δ
BF-01	✅ 100	✅ 100	—
BF-02	✅ 88	✅ 100	+12
BF-03	❌ 0	❌ 0	—
BF-04	✅ 100	✅ 100	—
BF-05	❌ 40	⚠️ 70	+30
BF-06	❌ 0	❌ 0	—
BF-07	✅ 100	✅ 100	—
BF-08	✅ 100	✅ 100	—
BF-09	✅ 100	✅ 100	—
BF-10	❌ 0	❌ 0	—
BF-11	⚠️ 60	✅ 100	+40
BF-12	❌ 0 (timeout)	✅ 100	+100
BF-13	✅ 100	✅ 100	—
BF-14	⚠️ 70	⚠️ 60	−10
BF-15	⚠️ 70	⚠️ 60	−10

⚡ MTP Speculative Decoding

What is MTP?

Multi-Token Prediction (MTP) is a speculative decoding technique where a small "draft head" predicts multiple future tokens in parallel. The main model then verifies these drafts in a single forward pass, accepting correct predictions for up to 2-4× speedup in practice.

Injection Method

The original Qwen3.5-9B base model ships with a 4-layer MTP head (~243M params) in its architecture configuration. During QwenPaw fine-tuning, the MTP head weights were stripped (only the config placeholder mtp_num_hidden_layers: 1 remained, but no actual tensors existed in the safetensors).

Recovery process:

Total injected parameters: 243.3M (2.7% of main model)

MTP acceptance rate (draft-n-max=2): ~50% (1083 accepted / 2166 generated across all benchmarks)

Why This Works

The MTP head is a lightweight 4-layer MLP decoder that maps the main model's last hidden state to future token logits. It sits entirely in speculative decoding space — the main model's weights are unchanged, so no fine-tuning or retraining is needed. The head simply needs to exist with compatible dimensions for llama.cpp's --spec-type draft-mtp to activate.

⚡ Comparison: QwenPaw MTP vs Other Models

Model	Total	ToolCall-15	HermesAgent-20	BugFind-15	Total Time
🐾 QwenPaw MTP 9B	4035 🥇	100% 🥇	75.3%	68.7%	7.8min 🥇
🐾 QwenPaw 9B (no MTP)	3873	93.3%	77.2% 🥇	61.9%	14.7min
🧠 Qwopus 9B MTP	3935	93.3%	67.3% ⚠️	79.0% 🥇	21.3min ⚠️
🧠 Qwen 35B Thinking ON	1445 (HA only)	—	72.3%	—	7.0min
⚡ Qwen 35B Thinking OFF	1370 (HA only)	—	68.5%	—	5.1min
🔮 Gemma 4 26B	1405 (HA only)	—	70.3%	—	18.6min

QwenPaw MTP wins on 2/3 benchmarks + total score + total time. The only benchmark it loses is BugFind-15 (to Qwopus MTP), but Qwopus suffers from severe instability (255pt variance on HermesAgent-20, with a worst-case 6.2min timeout).

🧠 Model Description

Base model**: QwenPaw-Flash-9B (Qwen3.5-9B fine-tuned for autonomous agent scenarios)
MTP head source**: Qwen/Qwen3.5-9B (original base model, layer 32 MTP head)
Tool**: Heretic v1.3.0 (automatic directional ablation)
Best trial**: #194 / 230 trials (abliteration)

⚙️ Abliteration Parameters

direction_index = 21.13 attn.o_proj.max_weight = 1.42 attn.o_proj.max_weight_position = 21.72 attn.o_proj.min_weight = 1.11 attn.o_proj.min_weight_distance = 18.14 mlp.down_proj.max_weight = 1.48 mlp.down_proj.max_weight_position = 21.23 mlp.down_proj.min_weight = 1.47 mlp.down_proj.min_weight_distance = 17.47

🏗️ Architecture

Type**: Qwen3_5ForConditionalGeneration (multimodal with vision encoder) + MTP spec head
Main model parameters**: ~9B
MTP head parameters**: ~243M (2.7% overhead)
Layers**: 32 (hybrid: Gated DeltaNet + Gated Attention) + 4 MTP decoder layers
Context length**: 262,144 tokens
Speculative decoding**: --spec-type draft-mtp with --spec-draft-n-max 2

📦 GGUF Files

File	Size	Notes
`QwenPaw-Flash-9B-heretic-MTP-Q8_0.gguf`	~9.2GB	High quality, near lossless
`QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf`	~7.1GB	✅ Recommended, best value
`QwenPaw-Flash-9B-heretic-MTP-Q4_K_M.gguf`	~5.4GB	Compact
`mmproj-BF16`	~880MB	Vision encoder (multimodal) — same as non-MTP version

🚀 Usage

--spec-type draft-mtp

--spec-draft-n-max 2

llama.cpp (with MTP speculative decoding)

# Start server with MTP enabled llama-server -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \ -ngl 99 -fa on -c 8192 \ --spec-type draft-mtp --spec-draft-n-max 2 \ --host 0.0.0.0 --port 8088 # Or with CLI llama-cli -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \ -ngl 99 -fa on -c 8192 \ --spec-type draft-mtp --spec-draft-n-max 2 \ -p "Write a Python script to..."

llama.cpp (without MTP, fallback)

# The model works as a normal GGUF too — just omit spec args llama-server -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \ -ngl 99 -fa on -c 8192 \ --host 0.0.0.0 --port 8088

LM Studio

Load the GGUF file directly. For MTP speculative decoding, LM Studio must support --spec-type — if not, the model functions as a standard 9B model.

📝 Notes

Safety filters have been significantly reduced via abliteration
KL divergence is only 0.0225 — minimal impact on model intelligence
The original model supports multimodal (vision); GGUF versions require the mmproj file from the non-MTP release
BenchLocal scores measured at Q6_K on RTX 5070 Ti 16GB with llama.cpp (turboquant). Each scenario was run once with no retries
MTP acceptance rate of ~50% at draft-n-max=2 means ~25-40% wall-clock speedup on short prompts, and up to 4× on long-generation tasks (debugging, code writing)
BugFind-15 saw the largest improvement (4.1×) because debugging tasks are generation-heavy — more tokens, more drafts accepted
The MTP head is a lossless copy from the original Qwen3.5-9B — no training was involved, simply weight injection
Agent-heavy scenarios (HermesAgent-20) see the least MTP benefit because short-turn interactions don't give the draft head enough runway
Please use responsibly

🙏 Acknowledgements

Heretic — Automated censorship removal
agentscope-ai/QwenPaw-Flash-9B — Base model
Qwen/Qwen3.5-9B — MTP head source
llama.cpp — GGUF quantization and inference
BenchLocal — Local model agent evaluation suite