--- language: - en - zh base_model: - Qwen/Qwen3.5-9B - agentscope-ai/QwenPaw-Flash-9B tags: - heretic - abliteration - uncensored - mtp - speculative-decoding - qwen3.5 - gguf - benchlocal - benchmark - agent - tool-call license: apache-2.0 ---
QwenPaw-Flash-9B-heretic non-MTP Version: QwenPaw-Flash-9B-heretic-GGUF
๐ BenchLocal Total: 4035/5000 (80.7%) โ MTP Speculative Decoding Injected
Uncensored ยท Abliterated ยท Agent-Optimized ยท 1.7-4.1ร Speedup
Uncensored version of QwenPaw-Flash-9B, processed with Heretic v1.3.0 abliteration, with MTP (Multi-Token Prediction) head weights injected from the original Qwen3.5-9B base model.
By reconstructing the MTP speculative decoding head โ which was stripped during the QwenPaw fine-tuning process โ this model achieves up to 4.1ร inference speedup on real agent benchmarks while maintaining or improving accuracy.
Test Environment: NVIDIA RTX 5070 Ti (16GB) ยท llama.cpp (turboquant build, --spec-type draft-mtp) ยท Q6_K quant
Framework: BenchLocal โ local model agent evaluation suite
Methodology: Each scenario run once, no retries, no second attempts
| Benchmark | Score | Accuracy | Results | Time | vs No-MTP |
|---|---|---|---|---|---|
| ToolCall-15 ๐ ๏ธ | 1500/1500 | 100% | 15โ 0โ ๏ธ 0โ | 0.65min | 1.4ร faster |
| HermesAgent-20 ๐ค | 1505/2000 | 75.3% | 12โ 1โ ๏ธ 7โ | 5.3min | 1.17ร faster |
| BugFind-15 ๐ | 1030/1500 | 68.7% | 9โ 2โ ๏ธ 4โ | 1.8min | 4.1ร faster |
| Total | 4035/5000 | 80.7% | 36โ 3โ ๏ธ 11โ | 7.8min | 1.9ร faster |
| Benchmark | Without MTP | With MTP | ฮ Score | ฮ Speed |
|---|---|---|---|---|
| ToolCall-15 ๐ ๏ธ | 1400/1500 (93.3%) | 1500/1500 (100%) | +100 pts | 1.4ร |
| HermesAgent-20 ๐ค | 1545/2000 (77.2%) | 1505/2000 (75.3%) | โ40 pts | 1.17ร |
| BugFind-15 ๐ | 928/1500 (61.9%) | 1030/1500 (68.7%) | +102 pts | 4.1ร |
| Total | 3873/5000 (77.5%) | 4035/5000 (80.7%) | +162 pts | 1.9ร |
| Total Time | 14.7 min | 7.8 min | โ | 1.9ร |
MTP speculative decoding eliminated the single failure (TC-05: Relative date/time parsing, which previously scored 0). All 15 scenarios now pass perfectly.
| TC-ID | Result | Scenario |
|---|---|---|
| TC-01โTC-04 | โ | Simple / Multi / Nested / Type conversion |
| TC-05 | โ | Relative date/time parsing โ fixed by MTP |
| TC-06โTC-15 | โ | All remaining scenarios |
MTP decoding introduces minor noise in long-chain reasoning scenarios (~40pt drop), likely because draft tokens occasionally derail the generation path in multi-step planning tasks. However, the speed gain (1.17ร) and the fact that the drop is within noise range (single-run variance was 255pts for Qwopus MTP) makes this an acceptable trade-off.
Significant improvement โ MTP's faster decoding effectively prevents timeout failures (BF-12 previously hit 300s timeout, now completes in time) and the draft context helps maintain debugging focus.
| BF-ID | Without MTP | With MTP | ฮ |
|---|---|---|---|
| BF-01 | โ 100 | โ 100 | โ |
| BF-02 | โ 88 | โ 100 | +12 |
| BF-03 | โ 0 | โ 0 | โ |
| BF-04 | โ 100 | โ 100 | โ |
| BF-05 | โ 40 | โ ๏ธ 70 | +30 |
| BF-06 | โ 0 | โ 0 | โ |
| BF-07 | โ 100 | โ 100 | โ |
| BF-08 | โ 100 | โ 100 | โ |
| BF-09 | โ 100 | โ 100 | โ |
| BF-10 | โ 0 | โ 0 | โ |
| BF-11 | โ ๏ธ 60 | โ 100 | +40 |
| BF-12 | โ 0 (timeout) | โ 100 | +100 |
| BF-13 | โ 100 | โ 100 | โ |
| BF-14 | โ ๏ธ 70 | โ ๏ธ 60 | โ10 |
| BF-15 | โ ๏ธ 70 | โ ๏ธ 60 | โ10 |
Multi-Token Prediction (MTP) is a speculative decoding technique where a small "draft head" predicts multiple future tokens in parallel. The main model then verifies these drafts in a single forward pass, accepting correct predictions for up to 2-4ร speedup in practice.
The original Qwen3.5-9B base model ships with a 4-layer MTP head (~243M params) in its architecture configuration. During QwenPaw fine-tuning, the MTP head weights were stripped (only the config placeholder mtp_num_hidden_layers: 1 remained, but no actual tensors existed in the safetensors).
Recovery process:
Total injected parameters: 243.3M (2.7% of main model)
MTP acceptance rate (draft-n-max=2): ~50% (1083 accepted / 2166 generated across all benchmarks)
The MTP head is a lightweight 4-layer MLP decoder that maps the main model's last hidden state to future token logits. It sits entirely in speculative decoding space โ the main model's weights are unchanged, so no fine-tuning or retraining is needed. The head simply needs to exist with compatible dimensions for llama.cpp's --spec-type draft-mtp to activate.
| Model | Total | ToolCall-15 | HermesAgent-20 | BugFind-15 | Total Time |
|---|---|---|---|---|---|
| ๐พ QwenPaw MTP 9B | 4035 ๐ฅ | 100% ๐ฅ | 75.3% | 68.7% | 7.8min ๐ฅ |
| ๐พ QwenPaw 9B (no MTP) | 3873 | 93.3% | 77.2% ๐ฅ | 61.9% | 14.7min |
| ๐ง Qwopus 9B MTP | 3935 | 93.3% | 67.3% โ ๏ธ | 79.0% ๐ฅ | 21.3min โ ๏ธ |
| ๐ง Qwen 35B Thinking ON | 1445 (HA only) | โ | 72.3% | โ | 7.0min |
| โก Qwen 35B Thinking OFF | 1370 (HA only) | โ | 68.5% | โ | 5.1min |
| ๐ฎ Gemma 4 26B | 1405 (HA only) | โ | 70.3% | โ | 18.6min |
QwenPaw MTP wins on 2/3 benchmarks + total score + total time. The only benchmark it loses is BugFind-15 (to Qwopus MTP), but Qwopus suffers from severe instability (255pt variance on HermesAgent-20, with a worst-case 6.2min timeout).
direction_index = 21.13 attn.o_proj.max_weight = 1.42 attn.o_proj.max_weight_position = 21.72 attn.o_proj.min_weight = 1.11 attn.o_proj.min_weight_distance = 18.14 mlp.down_proj.max_weight = 1.48 mlp.down_proj.max_weight_position = 21.23 mlp.down_proj.min_weight = 1.47 mlp.down_proj.min_weight_distance = 17.47
--spec-type draft-mtp with --spec-draft-n-max 2| File | Size | Notes |
|---|---|---|
QwenPaw-Flash-9B-heretic-MTP-Q8_0.gguf |
~9.2GB | High quality, near lossless |
QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf |
~7.1GB | โ Recommended, best value |
QwenPaw-Flash-9B-heretic-MTP-Q4_K_M.gguf |
~5.4GB | Compact |
mmproj-BF16 |
~880MB | Vision encoder (multimodal) โ same as non-MTP version |
--spec-type draft-mtp
--spec-draft-n-max 2
# Start server with MTP enabled llama-server -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \ -ngl 99 -fa on -c 8192 \ --spec-type draft-mtp --spec-draft-n-max 2 \ --host 0.0.0.0 --port 8088 # Or with CLI llama-cli -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \ -ngl 99 -fa on -c 8192 \ --spec-type draft-mtp --spec-draft-n-max 2 \ -p "Write a Python script to..."
# The model works as a normal GGUF too โ just omit spec args llama-server -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \ -ngl 99 -fa on -c 8192 \ --host 0.0.0.0 --port 8088
Load the GGUF file directly. For MTP speculative decoding, LM Studio must support --spec-type โ if not, the model functions as a standard 9B model.