QwenPaw-Flash-9B-heretic-MTP

QwenPaw-Flash-9B-heretic non-MTP Version: QwenPaw-Flash-9B-heretic-GGUF

🏆 BenchLocal Total: 4035/5000 (80.7%) — MTP Speculative Decoding Injected
Uncensored · Abliterated · Agent-Optimized · 1.7-4.1× Speedup

Uncensored version of QwenPaw-Flash-9B, processed with Heretic v1.3.0 abliteration, with MTP (Multi-Token Prediction) head weights injected from the original Qwen3.5-9B base model.

By reconstructing the MTP speculative decoding head — which was stripped during the QwenPaw fine-tuning process — this model achieves up to 4.1× inference speedup on real agent benchmarks while maintaining or improving accuracy.


🏆 BenchLocal Benchmarks (With MTP)

Test Environment: NVIDIA RTX 5070 Ti (16GB) · llama.cpp (turboquant build, --spec-type draft-mtp) · Q6_K quant
Framework: BenchLocal — local model agent evaluation suite
Methodology: Each scenario run once, no retries, no second attempts

Benchmark Score Accuracy Results Time vs No-MTP
ToolCall-15 🛠️ 1500/1500 100% 15✅ 0⚠️ 0❌ 0.65min 1.4× faster
HermesAgent-20 🤖 1505/2000 75.3% 12✅ 1⚠️ 7❌ 5.3min 1.17× faster
BugFind-15 🐛 1030/1500 68.7% 9✅ 2⚠️ 4❌ 1.8min 4.1× faster
Total 4035/5000 80.7% 36✅ 3⚠️ 11❌ 7.8min 1.9× faster

Comparison: With vs Without MTP

Benchmark Without MTP With MTP Δ Score Δ Speed
ToolCall-15 🛠️ 1400/1500 (93.3%) 1500/1500 (100%) +100 pts 1.4×
HermesAgent-20 🤖 1545/2000 (77.2%) 1505/2000 (75.3%) −40 pts 1.17×
BugFind-15 🐛 928/1500 (61.9%) 1030/1500 (68.7%) +102 pts 4.1×
Total 3873/5000 (77.5%) 4035/5000 (80.7%) +162 pts 1.9×
Total Time 14.7 min 7.8 min 1.9×

🛠️ ToolCall-15 — Tool Calling Stability (100%, +6.7 pts)

MTP speculative decoding eliminated the single failure (TC-05: Relative date/time parsing, which previously scored 0). All 15 scenarios now pass perfectly.

TC-ID Result Scenario
TC-01–TC-04 Simple / Multi / Nested / Type conversion
TC-05 Relative date/time parsing ← fixed by MTP
TC-06–TC-15 All remaining scenarios

🤖 HermesAgent-20 — Complex Agent Tasks (75.3%, −1.9 pts)

MTP decoding introduces minor noise in long-chain reasoning scenarios (~40pt drop), likely because draft tokens occasionally derail the generation path in multi-step planning tasks. However, the speed gain (1.17×) and the fact that the drop is within noise range (single-run variance was 255pts for Qwopus MTP) makes this an acceptable trade-off.

🐛 BugFind-15 — Code Debugging (68.7%, +6.8 pts)

Significant improvement — MTP's faster decoding effectively prevents timeout failures (BF-12 previously hit 300s timeout, now completes in time) and the draft context helps maintain debugging focus.

BF-ID Without MTP With MTP Δ
BF-01 ✅ 100 ✅ 100
BF-02 ✅ 88 ✅ 100 +12
BF-03 ❌ 0 ❌ 0
BF-04 ✅ 100 ✅ 100
BF-05 ❌ 40 ⚠️ 70 +30
BF-06 ❌ 0 ❌ 0
BF-07 ✅ 100 ✅ 100
BF-08 ✅ 100 ✅ 100
BF-09 ✅ 100 ✅ 100
BF-10 ❌ 0 ❌ 0
BF-11 ⚠️ 60 ✅ 100 +40
BF-12 ❌ 0 (timeout) ✅ 100 +100
BF-13 ✅ 100 ✅ 100
BF-14 ⚠️ 70 ⚠️ 60 −10
BF-15 ⚠️ 70 ⚠️ 60 −10

MTP Speculative Decoding

What is MTP?

Multi-Token Prediction (MTP) is a speculative decoding technique where a small "draft head" predicts multiple future tokens in parallel. The main model then verifies these drafts in a single forward pass, accepting correct predictions for up to 2-4× speedup in practice.

Injection Method

The original Qwen3.5-9B base model ships with a 4-layer MTP head (~243M params) in its architecture configuration. During QwenPaw fine-tuning, the MTP head weights were stripped (only the config placeholder mtp_num_hidden_layers: 1 remained, but no actual tensors existed in the safetensors).

Recovery process:

  1. Extracted the 15 MTP tensors (model.layers.32.nextn.*) from the original Qwen3.5-9B safetensors
  2. Validated shape compatibility with QwenPaw's hidden dimension (3584 → 2048 → 1536 → 152064 vocabulary)
  3. Injected as a new safetensor shard (model-00009-of-00009.safetensors, 0.45 GB)
  4. Updated model.safetensors.index.json with MTP weight entries
  5. Converted to GGUF via convert_hf_to_gguf.py with --outtype bf16 (auto-detect nextn_predict_layers=1)
  6. Quantized to Q6_K / Q8_0 / Q4_K_M

Total injected parameters: 243.3M (2.7% of main model) MTP acceptance rate (draft-n-max=2): ~50% (1083 accepted / 2166 generated across all benchmarks)

Why This Works

The MTP head is a lightweight 4-layer MLP decoder that maps the main model's last hidden state to future token logits. It sits entirely in speculative decoding space — the main model's weights are unchanged, so no fine-tuning or retraining is needed. The head simply needs to exist with compatible dimensions for llama.cpp's --spec-type draft-mtp to activate.


Comparison: QwenPaw MTP vs Other Models

Model Total ToolCall-15 HermesAgent-20 BugFind-15 Total Time
🐾 QwenPaw MTP 9B 4035 🥇 100% 🥇 75.3% 68.7% 7.8min 🥇
🐾 QwenPaw 9B (no MTP) 3873 93.3% 77.2% 🥇 61.9% 14.7min
🧠 Qwopus 9B MTP 3935 93.3% 67.3% ⚠️ 79.0% 🥇 21.3min ⚠️
🧠 Qwen 35B Thinking ON 1445 (HA only) 72.3% 7.0min
⚡ Qwen 35B Thinking OFF 1370 (HA only) 68.5% 5.1min
🔮 Gemma 4 26B 1405 (HA only) 70.3% 18.6min

QwenPaw MTP wins on 2/3 benchmarks + total score + total time. The only benchmark it loses is BugFind-15 (to Qwopus MTP), but Qwopus suffers from severe instability (255pt variance on HermesAgent-20, with a worst-case 6.2min timeout).


Model Description

  • Base model: QwenPaw-Flash-9B (Qwen3.5-9B fine-tuned for autonomous agent scenarios)
  • MTP head source: Qwen/Qwen3.5-9B (original base model, layer 32 MTP head)
  • Tool: Heretic v1.3.0 (automatic directional ablation)
  • Best trial: #194 / 230 trials (abliteration)

Abliteration Parameters

direction_index = 21.13
attn.o_proj.max_weight = 1.42
attn.o_proj.max_weight_position = 21.72
attn.o_proj.min_weight = 1.11
attn.o_proj.min_weight_distance = 18.14
mlp.down_proj.max_weight = 1.48
mlp.down_proj.max_weight_position = 21.23
mlp.down_proj.min_weight = 1.47
mlp.down_proj.min_weight_distance = 17.47

Architecture

  • Type: Qwen3_5ForConditionalGeneration (multimodal with vision encoder) + MTP spec head
  • Main model parameters: ~9B
  • MTP head parameters: ~243M (2.7% overhead)
  • Layers: 32 (hybrid: Gated DeltaNet + Gated Attention) + 4 MTP decoder layers
  • Context length: 262,144 tokens
  • Speculative decoding: --spec-type draft-mtp with --spec-draft-n-max 2

GGUF Files

File Size Notes
QwenPaw-Flash-9B-heretic-MTP-Q8_0.gguf ~9.2GB High quality, near lossless
QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf ~7.1GB Recommended, best value
QwenPaw-Flash-9B-heretic-MTP-Q4_K_M.gguf ~5.4GB Compact
mmproj-BF16 ~880MB Vision encoder (multimodal) — same as non-MTP version

Usage

--spec-type draft-mtp --spec-draft-n-max 2

llama.cpp (with MTP speculative decoding)

# Start server with MTP enabled
llama-server -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \
  -ngl 99 -fa on -c 8192 \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  --host 0.0.0.0 --port 8088

# Or with CLI
llama-cli -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \
  -ngl 99 -fa on -c 8192 \
  --spec-type draft-mtp --spec-draft-n-max 2 \
  -p "Write a Python script to..."

llama.cpp (without MTP, fallback)

# The model works as a normal GGUF too — just omit spec args
llama-server -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \
  -ngl 99 -fa on -c 8192 \
  --host 0.0.0.0 --port 8088

LM Studio

Load the GGUF file directly. For MTP speculative decoding, LM Studio must support --spec-type — if not, the model functions as a standard 9B model.

Notes

  1. Safety filters have been significantly reduced via abliteration
  2. KL divergence is only 0.0225 — minimal impact on model intelligence
  3. The original model supports multimodal (vision); GGUF versions require the mmproj file from the non-MTP release
  4. BenchLocal scores measured at Q6_K on RTX 5070 Ti 16GB with llama.cpp (turboquant). Each scenario was run once with no retries
  5. MTP acceptance rate of ~50% at draft-n-max=2 means ~25-40% wall-clock speedup on short prompts, and up to 4× on long-generation tasks (debugging, code writing)
  6. BugFind-15 saw the largest improvement (4.1×) because debugging tasks are generation-heavy — more tokens, more drafts accepted
  7. The MTP head is a lossless copy from the original Qwen3.5-9B — no training was involved, simply weight injection
  8. Agent-heavy scenarios (HermesAgent-20) see the least MTP benefit because short-turn interactions don't give the draft head enough runway
  9. Please use responsibly

Acknowledgements

Downloads last month
462
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for SC117/QwenPaw-Flash-9B-heretic-MTP-GGUF

Finetuned
Qwen/Qwen3.5-9B
Quantized
(227)
this model