--- language: - en - zh base_model: - Qwen/Qwen3.5-9B - agentscope-ai/QwenPaw-Flash-9B tags: - heretic - abliteration - uncensored - mtp - speculative-decoding - qwen3.5 - gguf - benchlocal - benchmark - agent - tool-call license: apache-2.0 ---
MTPGGUF

QwenPaw-Flash-9B-heretic-MTP

English | ๐Ÿ“– ไธญๆ–‡ๆ–‡ๆกฃ

QwenPaw-Flash-9B-heretic non-MTP Version: QwenPaw-Flash-9B-heretic-GGUF

๐Ÿ† BenchLocal Total: 4035/5000 (80.7%) โ€” MTP Speculative Decoding Injected

Uncensored ยท Abliterated ยท Agent-Optimized ยท 1.7-4.1ร— Speedup

Uncensored version of QwenPaw-Flash-9B, processed with Heretic v1.3.0 abliteration, with MTP (Multi-Token Prediction) head weights injected from the original Qwen3.5-9B base model.

By reconstructing the MTP speculative decoding head โ€” which was stripped during the QwenPaw fine-tuning process โ€” this model achieves up to 4.1ร— inference speedup on real agent benchmarks while maintaining or improving accuracy.

๐Ÿ“Š ๐Ÿ† BenchLocal Benchmarks (With MTP)

Test Environment: NVIDIA RTX 5070 Ti (16GB) ยท llama.cpp (turboquant build, --spec-type draft-mtp) ยท Q6_K quant

Framework: BenchLocal โ€” local model agent evaluation suite

Methodology: Each scenario run once, no retries, no second attempts

Benchmark Score Accuracy Results Time vs No-MTP
ToolCall-15 ๐Ÿ› ๏ธ 1500/1500 100% 15โœ… 0โš ๏ธ 0โŒ 0.65min 1.4ร— faster
HermesAgent-20 ๐Ÿค– 1505/2000 75.3% 12โœ… 1โš ๏ธ 7โŒ 5.3min 1.17ร— faster
BugFind-15 ๐Ÿ› 1030/1500 68.7% 9โœ… 2โš ๏ธ 4โŒ 1.8min 4.1ร— faster
Total 4035/5000 80.7% 36โœ… 3โš ๏ธ 11โŒ 7.8min 1.9ร— faster

Comparison: With vs Without MTP

Benchmark Without MTP With MTP ฮ” Score ฮ” Speed
ToolCall-15 ๐Ÿ› ๏ธ 1400/1500 (93.3%) 1500/1500 (100%) +100 pts 1.4ร—
HermesAgent-20 ๐Ÿค– 1545/2000 (77.2%) 1505/2000 (75.3%) โˆ’40 pts 1.17ร—
BugFind-15 ๐Ÿ› 928/1500 (61.9%) 1030/1500 (68.7%) +102 pts 4.1ร—
Total 3873/5000 (77.5%) 4035/5000 (80.7%) +162 pts 1.9ร—
Total Time 14.7 min 7.8 min โ€” 1.9ร—

๐Ÿ› ๏ธ ToolCall-15 โ€” Tool Calling Stability (100%, +6.7 pts)

MTP speculative decoding eliminated the single failure (TC-05: Relative date/time parsing, which previously scored 0). All 15 scenarios now pass perfectly.

TC-ID Result Scenario
TC-01โ€“TC-04 โœ… Simple / Multi / Nested / Type conversion
TC-05 โœ… Relative date/time parsing โ† fixed by MTP
TC-06โ€“TC-15 โœ… All remaining scenarios

๐Ÿค– HermesAgent-20 โ€” Complex Agent Tasks (75.3%, โˆ’1.9 pts)

MTP decoding introduces minor noise in long-chain reasoning scenarios (~40pt drop), likely because draft tokens occasionally derail the generation path in multi-step planning tasks. However, the speed gain (1.17ร—) and the fact that the drop is within noise range (single-run variance was 255pts for Qwopus MTP) makes this an acceptable trade-off.

๐Ÿ› BugFind-15 โ€” Code Debugging (68.7%, +6.8 pts)

Significant improvement โ€” MTP's faster decoding effectively prevents timeout failures (BF-12 previously hit 300s timeout, now completes in time) and the draft context helps maintain debugging focus.

BF-ID Without MTP With MTP ฮ”
BF-01 โœ… 100 โœ… 100 โ€”
BF-02 โœ… 88 โœ… 100 +12
BF-03 โŒ 0 โŒ 0 โ€”
BF-04 โœ… 100 โœ… 100 โ€”
BF-05 โŒ 40 โš ๏ธ 70 +30
BF-06 โŒ 0 โŒ 0 โ€”
BF-07 โœ… 100 โœ… 100 โ€”
BF-08 โœ… 100 โœ… 100 โ€”
BF-09 โœ… 100 โœ… 100 โ€”
BF-10 โŒ 0 โŒ 0 โ€”
BF-11 โš ๏ธ 60 โœ… 100 +40
BF-12 โŒ 0 (timeout) โœ… 100 +100
BF-13 โœ… 100 โœ… 100 โ€”
BF-14 โš ๏ธ 70 โš ๏ธ 60 โˆ’10
BF-15 โš ๏ธ 70 โš ๏ธ 60 โˆ’10
โšก MTP Speculative Decoding

What is MTP?

Multi-Token Prediction (MTP) is a speculative decoding technique where a small "draft head" predicts multiple future tokens in parallel. The main model then verifies these drafts in a single forward pass, accepting correct predictions for up to 2-4ร— speedup in practice.

Injection Method

The original Qwen3.5-9B base model ships with a 4-layer MTP head (~243M params) in its architecture configuration. During QwenPaw fine-tuning, the MTP head weights were stripped (only the config placeholder mtp_num_hidden_layers: 1 remained, but no actual tensors existed in the safetensors).

Recovery process:

              Total injected parameters: 243.3M (2.7% of main model)

              MTP acceptance rate (draft-n-max=2): ~50% (1083 accepted / 2166 generated across all benchmarks)

              Why This Works

              The MTP head is a lightweight 4-layer MLP decoder that maps the main model's last hidden state to future token logits. It sits entirely in speculative decoding space โ€” the main model's weights are unchanged, so no fine-tuning or retraining is needed. The head simply needs to exist with compatible dimensions for llama.cpp's --spec-type draft-mtp to activate.

              โšก Comparison: QwenPaw MTP vs Other Models
              Model Total ToolCall-15 HermesAgent-20 BugFind-15 Total Time
              ๐Ÿพ QwenPaw MTP 9B 4035 ๐Ÿฅ‡ 100% ๐Ÿฅ‡ 75.3% 68.7% 7.8min ๐Ÿฅ‡
              ๐Ÿพ QwenPaw 9B (no MTP) 3873 93.3% 77.2% ๐Ÿฅ‡ 61.9% 14.7min
              ๐Ÿง  Qwopus 9B MTP 3935 93.3% 67.3% โš ๏ธ 79.0% ๐Ÿฅ‡ 21.3min โš ๏ธ
              ๐Ÿง  Qwen 35B Thinking ON 1445 (HA only) โ€” 72.3% โ€” 7.0min
              โšก Qwen 35B Thinking OFF 1370 (HA only) โ€” 68.5% โ€” 5.1min
              ๐Ÿ”ฎ Gemma 4 26B 1405 (HA only) โ€” 70.3% โ€” 18.6min

              QwenPaw MTP wins on 2/3 benchmarks + total score + total time. The only benchmark it loses is BugFind-15 (to Qwopus MTP), but Qwopus suffers from severe instability (255pt variance on HermesAgent-20, with a worst-case 6.2min timeout).

              ๐Ÿง  Model Description
              • Base model**: QwenPaw-Flash-9B (Qwen3.5-9B fine-tuned for autonomous agent scenarios)
              • MTP head source**: Qwen/Qwen3.5-9B (original base model, layer 32 MTP head)
              • Tool**: Heretic v1.3.0 (automatic directional ablation)
              • Best trial**: #194 / 230 trials (abliteration)
              โš™๏ธ Abliteration Parameters

              direction_index = 21.13 attn.o_proj.max_weight = 1.42 attn.o_proj.max_weight_position = 21.72 attn.o_proj.min_weight = 1.11 attn.o_proj.min_weight_distance = 18.14 mlp.down_proj.max_weight = 1.48 mlp.down_proj.max_weight_position = 21.23 mlp.down_proj.min_weight = 1.47 mlp.down_proj.min_weight_distance = 17.47

              ๐Ÿ—๏ธ Architecture
              • Type**: Qwen3_5ForConditionalGeneration (multimodal with vision encoder) + MTP spec head
              • Main model parameters**: ~9B
              • MTP head parameters**: ~243M (2.7% overhead)
              • Layers**: 32 (hybrid: Gated DeltaNet + Gated Attention) + 4 MTP decoder layers
              • Context length**: 262,144 tokens
              • Speculative decoding**: --spec-type draft-mtp with --spec-draft-n-max 2
              ๐Ÿ“ฆ GGUF Files
              File Size Notes
              QwenPaw-Flash-9B-heretic-MTP-Q8_0.gguf ~9.2GB High quality, near lossless
              QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf ~7.1GB โœ… Recommended, best value
              QwenPaw-Flash-9B-heretic-MTP-Q4_K_M.gguf ~5.4GB Compact
              mmproj-BF16 ~880MB Vision encoder (multimodal) โ€” same as non-MTP version
              ๐Ÿš€ Usage

              --spec-type draft-mtp

              --spec-draft-n-max 2

              llama.cpp (with MTP speculative decoding)

              # Start server with MTP enabled llama-server -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \ -ngl 99 -fa on -c 8192 \ --spec-type draft-mtp --spec-draft-n-max 2 \ --host 0.0.0.0 --port 8088 # Or with CLI llama-cli -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \ -ngl 99 -fa on -c 8192 \ --spec-type draft-mtp --spec-draft-n-max 2 \ -p "Write a Python script to..."

              llama.cpp (without MTP, fallback)

              # The model works as a normal GGUF too โ€” just omit spec args llama-server -m QwenPaw-Flash-9B-heretic-MTP-Q6_K.gguf \ -ngl 99 -fa on -c 8192 \ --host 0.0.0.0 --port 8088

              LM Studio

              Load the GGUF file directly. For MTP speculative decoding, LM Studio must support --spec-type โ€” if not, the model functions as a standard 9B model.

              ๐Ÿ“ Notes
              1. Safety filters have been significantly reduced via abliteration
              2. KL divergence is only 0.0225 โ€” minimal impact on model intelligence
              3. The original model supports multimodal (vision); GGUF versions require the mmproj file from the non-MTP release
              4. BenchLocal scores measured at Q6_K on RTX 5070 Ti 16GB with llama.cpp (turboquant). Each scenario was run once with no retries
              5. MTP acceptance rate of ~50% at draft-n-max=2 means ~25-40% wall-clock speedup on short prompts, and up to 4ร— on long-generation tasks (debugging, code writing)
              6. BugFind-15 saw the largest improvement (4.1ร—) because debugging tasks are generation-heavy โ€” more tokens, more drafts accepted
              7. The MTP head is a lossless copy from the original Qwen3.5-9B โ€” no training was involved, simply weight injection
              8. Agent-heavy scenarios (HermesAgent-20) see the least MTP benefit because short-turn interactions don't give the draft head enough runway
              9. Please use responsibly
              ๐Ÿ™ Acknowledgements