Qwen3.6-40B-Deckard-MTP GGUF

The first and only GGUFs of DavidAU's Qwen3.6-40B Opus-Deckard with working Multi-Token Prediction (MTP) speculative decoding — and, with an external mmproj, working vision.

This repo takes the Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-MAX-GGUF quants and injects an MTP head transplanted from the base Qwen3.6-27B architecture. No other published GGUF of this model includes MTP support.

Available Quants

File Body quant MTP head ~Size Best for
Qwen3.6-40B-Deckard-MTP-Q6_K.gguf Q6_K (~97% BF16) BF16 ~31 GB Highest fidelity; most VRAM; longest-tested
Qwen3.6-40B-Deckard-MTP-Q5_K_M.gguf Q5_K_M Q8_0 ~28 GB Balanced
Qwen3.6-40B-Deckard-MTP-Q4_K_M.gguf Q4_K_M Q4_K ~24 GB Lowest VRAM; smallest head

On the differing head precisions: the MTP head is grafted at whatever precision its donor carried (raw byte-copy, no requantization). I tested head precision against draft acceptance directly — across multiple seeds at draft depth n=3, a higher-precision head (Q8) and a body-matched head (Q4) landed within measurement noise of each other on a Q4 body. Acceptance appears dominated by the body quant and by context/task, not by head precision. So each quant carries a head sized to keep the file small rather than chasing a precision that didn't measurably move acceptance. The Q6_K is the original release and the variant tested over the longest duration (the multi-hour coding-session data below is all Q6_K); it retains its BF16 head for continuity. The Q4 and Q5 are newer and validated on shorter runs.

What's Different

  • MTP speculative decoding works out of the box — no separate draft model needed
  • Vision works via an external mmproj — the model accepts image input when paired with a Qwen3.6 vision projector, because the expanded 40B preserves the 27B's 5120 hidden dimension (see Vision / Multimodal below)
  • MTP and vision run simultaneously — confirmed on llama.cpp b9240+; image processing and MTP speculative decoding co-fire in the same request
  • MTP head grafted from base 27B, not fine-tuned — head precision per quant chosen for footprint, not acceptance (tested within noise across seeds; see Available Quants)
  • High sustained acceptance — 85-100% in established conversation context on coding tasks (temp 0.6, thinking mode); lower on fresh/short context, on image turns, and on less predictable content like creative writing (see What affects acceptance)
  • ~40% generation speedup — 50-58 t/s vs ~40 t/s baseline on an RTX PRO 6000 Blackwell

How This Was Made

DavidAU's 40B Deckard model was expanded from the base Qwen3.6-27B (64 layers → 96 layers, same hidden dimension of 5120). The expansion preserved the model width but did not include the MTP head from the base architecture.

The MTP head is architecturally a single transformer block (attention + SwiGLU FFN) plus projection layers (eh_proj, enorm, hnorm, shared_head_norm) that takes the main model's hidden state and predicts the next token. Since the hidden dimension (5120) is identical between the 27B and the expanded 40B, the MTP head tensors are dimensionally compatible.

The injection process:

  1. Extracted all 15 MTP tensors from blk.64 of a Qwen3.6-27B donor GGUF (at the donor's native precision)
  2. Remapped them to blk.96 (the MTP layer index for the 97-block 40B model)
  3. Binary-patched the target GGUF: inserted nextn_predict_layers = 1 metadata, updated block_count from 96 to 97, appended MTP tensor info and data
  4. Original model tensor data is byte-for-byte identical to the source quant — zero re-serialization of existing weights

The MTP head was not fine-tuned on the 40B's hidden states. Acceptance comes purely from the dimensional compatibility between the base 27B and the expanded 40B (shared 5120 hidden dim). Measured per-position acceptance on a coding task (temp 0.6, thinking on): ~0.91 / 0.82 / 0.74 at draft depth n=1 / 2 / 3 on fresh context, rising to 85-100% in sustained conversation. This is comparable to — and at times better than — a natively-trained 27B MTP head at the same draft depths, which is notable given this head received zero training; your mileage will vary by task and context. Self-distillation on the 40B's actual output distribution would likely lift the fresh-context and image-turn rates further.

What affects acceptance

MTP acceptance is not a single fixed number — it depends heavily on how predictable the next tokens are given the model's internal hidden state. This matters when choosing a draft depth and when interpreting the numbers below.

  • Highly predictable content (code, structured output, established conversation context): the next token is strongly determined by the hidden state, so the MTP head's drafts match the verifier often. This is where MTP shines — high acceptance, big speedups.
  • Less predictable content (open-ended creative writing, fresh context): each token does not as strongly imply the next in a single deterministic direction, so the head's chained drafts diverge from the verifier more often. Expect lower acceptance and a smaller speedup on creative work.

A large part of this is simply that the head was not trained alongside the 40B. A trained head learns the target's hidden-state-to-next-token mapping; a grafted head relies on the borrowed 27B mapping being close enough, which holds best where the next token is "obvious" and degrades where it isn't.

One honest unknown: I have not verified whether the sampler temperature is applied to the MTP head's own draft distribution in the same way it's applied to the main model. Empirically, lower temperature (peakier distributions, fewer plausible next tokens) tracks with higher acceptance, and higher-temperature creative settings track with lower acceptance — but whether that's the temperature acting on the head directly or just the underlying content being less predictable, I can't yet say for certain. Training the head is on the list of things to try; no promises on timeline.

Vision / Multimodal

The 40B Deckard can do image understanding when paired with an external Qwen3.6 vision projector (mmproj). The mmproj is not bundled in this repo — you supply it at launch with --mmproj.

Why a 27B mmproj works on the 40B

The exact same architectural fact that made the MTP graft work makes the vision projector work: the expanded 40B preserves the base 27B's hidden dimension of 5120. An mmproj projects encoded image features into the model's embedding space at n_embd, so a projector built against a 5120-wide Qwen3.6 model is dimensionally compatible with this 40B regardless of its greater depth. The projector fits the socket; the extra layers are downstream of where image embeddings inject. This is the same "interface is the embedding width, not the layer count" principle behind the MTP head.

The projector used and validated here is mmproj-Qwen3.6-27B-f16.gguf from froggeric's repo (~1.16 GiB worst-case VRAM).

Confirmed behavior

Validated on an RTX PRO 6000 Blackwell with llama.cpp b9352:

  • Accurate fine text reading — correctly reads small UI labels, clock times, and dropdown values from screenshots
  • Layout and UX reasoning — identifies structural redundancy, infers interaction models (e.g. single-click vs double-click navigation) from static frames, not just object labeling
  • Multi-turn visual memory — holds and ranks multiple images across a conversation, self-corrects when a duplicate image is sent
  • Vision + MTP together — image processing (~2.4 s for the first image, ~0.8 s for subsequent) and MTP speculative decoding co-fire in the same request; decode held ~48-54 t/s

MTP acceptance on vision turns

MTP continues to draft and accept during image turns, but at a lower rate than pure text:

Turn type MTP acceptance
Pure text (in-conversation) 85-100%
Image turns ~49%

This is expected and benign. The MTP head was grafted against the model's text distribution, so image-token sequences are out-of-distribution for the draft head — its predictions around the image are less accurate. Acceptance does not collapse to zero, so MTP remains worth running on vision turns (you still draft and land roughly half), and it returns to the full 85-100% range on the text turns of the same conversation. A vision-aware MTP head (self-distilled on multimodal hidden states) would lift the image-turn rate, but that is a research project, not a fix.

Note on find_slot: non-consecutive token position warnings: When an image is injected mid-sequence on this hybrid GDN + MTP + checkpoint stack, llama.cpp emits a burst of non-consecutive token position warnings during image processing. In testing these were noisy but benign — they did not corrupt description accuracy or break MTP drafting. If you also run context checkpoints, this is the same subsystem tracked in llama.cpp #23371; start at modest context if you hit VRAM pressure.

Launch with vision

./llama-server \
    -m Qwen3.6-40B-Deckard-MTP-Q6_K.gguf \
    --mmproj mmproj-Qwen3.6-27B-f16.gguf \
    --host 0.0.0.0 --port 8080 \
    -ngl 999 --flash-attn on --jinja \
    --image-min-tokens 1024 \
    --spec-type draft-mtp --spec-draft-n-max 2 \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0

Build requirement for vision + MTP: vision combined with MTP requires llama.cpp b9240 or newer. Earlier builds (the original PR #22673) crashed when combining vision with MTP; this was fixed in mainline. --image-min-tokens 1024 is recommended for Qwen-VL grounding accuracy on dense images.

Client note (OpenCode and other OpenAI-compatible clients): some clients strip image attachments unless the custom model is declared vision-capable. For OpenCode (@ai-sdk/openai-compatible), add a modalities block to the model config so images reach the server:

"qwen36-40b-deckard": {
  "name": "Qwen3.6 40B Deckard",
  "modalities": { "input": ["text", "image"], "output": ["text"] }
}

To confirm the server side independent of any client, send an image directly to /v1/chat/completions with an image_url content part and check for accurate description.

Model Specifications (Q6_K)

The table below describes the Q6_K variant. The Q5_K_M and Q4_K_M differ in body quant, MTP head precision, total tensor types, and file size — see Available Quants for the per-file summary.

Parameter Value
Base Model Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-MAX-GGUF
Architecture qwen35 (dense, not MoE)
Parameters 40B (expanded from 27B)
Layers 96 main + 1 MTP head = 97 total
Hidden Dimension 5120
Quantization NEO-CODE Di-IMatrix Q6_K (main model, ~97% of BF16) + BF16 (MTP head)
Total Tensors 1290 (1275 original + 15 MTP)
File Size ~30.97 GB
Context Length 262,144 tokens
MTP Donor Qwen3.6-27B (BF16 safetensors)
Vision Supported via external mmproj (5120-compatible Qwen3.6 projector); not bundled
Vision MTP Acceptance ~49% on image turns (text-grafted head, out-of-distribution on image tokens)

MTP Head Tensors

The following 15 tensors were injected at blk.96 (tensor types shown for the Q6_K variant, where the head is BF16; the Q5 head is Q8_0 and the Q4 head is Q4_K):

Tensor Shape Type (Q6_K build)
blk.96.nextn.eh_proj.weight [10240, 5120] BF16
blk.96.ffn_down.weight [17408, 5120] BF16
blk.96.ffn_gate.weight [5120, 17408] BF16
blk.96.ffn_up.weight [5120, 17408] BF16
blk.96.attn_k.weight [5120, 1024] BF16
blk.96.attn_q.weight [5120, 12288] BF16
blk.96.attn_v.weight [5120, 1024] BF16
blk.96.attn_output.weight [6144, 5120] BF16
blk.96.attn_norm.weight [5120] F32
blk.96.post_attention_norm.weight [5120] F32
blk.96.attn_k_norm.weight [256] F32
blk.96.attn_q_norm.weight [256] F32
blk.96.nextn.shared_head_norm.weight [5120] F32
blk.96.nextn.enorm.weight [5120] F32
blk.96.nextn.hnorm.weight [5120] F32

Recommended Settings

llama.cpp / llama-server

./llama-server \
    -m Qwen3.6-40B-Deckard-MTP-Q6_K.gguf \
    --host 0.0.0.0 --port 8080 \
    -ngl 999 --flash-attn on --jinja \
    --spec-type draft-mtp --spec-draft-n-max 2 \
    --temp 0.6 --top-k 20 --top-p 0.95

Swap -m for the Q5 or Q4 file as needed. For a vision-enabled launch, see Vision / Multimodal → Launch with vision.

On draft depth (--spec-draft-n-max): n=2 is a good default and tends to be the throughput sweet spot for predictable content like code — you draft more tokens per pass without acceptance falling far enough to hurt. n=1 is the conservative floor (highest per-token acceptance, smallest speedup). n=3 can win on very structured output but degrades faster on creative/open-ended text. Higher draft depths reward predictable content and penalize unpredictable content — tune to your workload.

Build requirement: MTP support requires llama.cpp with PR #22673 merged (mainline as of late May 2026). MTP + vision together requires b9240 or newer.

Sampling Parameters

Based on Qwen's official recommendations for the base architecture:

Use Case Temperature Top-P Top-K Presence Penalty
Coding (thinking mode) 0.6 0.95 20 0.0
General (thinking mode) 1.0 0.95 20 1.5
General (instruct/no-think) 0.7 0.8 20 1.5

DavidAU's additional guidance: rep_pen 1.05–1.1 for creative work with lower quants. Min context window 8K–16K.

VRAM Notes

Qwen3.6 uses a hybrid GDN (Gated DeltaNet) + full attention architecture at a 3:1 ratio. In the 40B (96 layers), 72 layers are GDN with fixed-size recurrent state (~225 MiB, constant regardless of context length) and 24 layers use full attention with traditional KV cache.

For reference, the base 27B (16 attention layers) uses ~150 MiB recurrent state and ~64 KB/token for KV cache at FP16. The 40B has 1.5x the attention layers (24 vs 16), so expect roughly 1.5x the KV cache cost per token. With KV cache quantization (--cache-type-k q8_0 --cache-type-v q8_0 or TurboQuant 3-bit), this drops substantially.

Component Size (Q6_K)
Model weights (Q6_K) + MTP head (BF16) ~31 GB
Recurrent state (fixed) ~225 MiB
mmproj vision encoder (when loaded, f16) ~1.16 GiB
KV cache per token (FP16, 24 attn layers) ~96 KB
KV cache at 32K context (FP16) ~3 GB
KV cache at 128K context (FP16) ~12 GB
KV cache at 262K context (FP16) ~25 GB

The Q5_K_M (28 GB) and Q4_K_M (24 GB) reduce the weights line accordingly; KV cache, recurrent state, and mmproj figures are unchanged since they depend on architecture and context, not body quant. These are estimates extrapolated from measured 27B numbers scaled by the 1.5x attention layer ratio. Actual usage depends on your --cache-type-k/v settings, batch size, and framework overhead. With q8_0 cache quantization, halve the KV cache numbers. With TurboQuant 3-bit, divide by ~4.6x.

Context scaling note: in llama-server, -c is the total KV budget across all slots and is divided by --parallel. To give each of N parallel slots a target context, set -c = (per-slot target × N), capped per-slot at the 262K native limit (YaRN required beyond, with a short-context quality tax). Concurrency comes from parallel slots on one model load — you do not need separate model instances for more concurrent agents.

Benchmarks

Measured on NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM, 1,792 GB/s bandwidth). The MTP head is untrained — these results are achieved purely from dimensional compatibility between the 27B donor and the 40B expanded model. The sustained-session data below is from the Q6_K variant, which has been tested over the longest duration. Q4 and Q5 were validated on shorter runs and land in a similar band; acceptance is dominated by body quant, context, and content predictability rather than head precision.

Per-position acceptance (fresh context, coding task, temp 0.6, thinking on)

Draft depth Acceptance
n=1 ~0.91
n=2 ~0.82
n=3 ~0.74

For loose reference, a natively-trained 27B MTP head sits in a broadly similar range at these depths (roughly high-0.8s falling toward low-0.6s by n=3 in third-party reports). This grafted head is comparable, and sometimes better — encouraging for a head that received no training — but cross-setup MTP numbers are measured under differing runtimes, quants, and conditions, so treat any head-to-head as approximate.

Sustained acceptance (Q6_K, in-conversation, by context depth)

Context Depth Acceptance Rate Notes
Fresh context (~5K tokens) ~72% Cold start, no prior conversation
Mid conversation (~55-65K) 95-100% Seven consecutive 100% runs observed
Deep context (~65-80K) 85-98% Sustained high acceptance
Very deep context (~80-87K) 86-98% No degradation at depth
Image turns (vision) ~49% Text-grafted head is OOD on image tokens; does not collapse

Acceptance rate improves as conversation context builds — the model's output distribution narrows within an established context, making MTP predictions more accurate.

Throughput (Q6_K)

Metric With MTP Without MTP (baseline)
Generation (fresh context) 56-58 t/s ~40 t/s
Generation (50K+ context) 50-55 t/s ~35 t/s
Generation (80K+ context) 50-51 t/s ~30-35 t/s
Generation (vision turns) ~48-54 t/s
Prompt processing ~1,200-1,800 t/s ~1,200-1,800 t/s
Image processing (per image) ~0.8-2.4 s
Effective speedup ~40%

Raw Data (Q6_K)

Acceptance rates from a continuous coding session (~85 request/response cycles, 54K-87K context):

94.7%, 81.3%, 95.2%, 100%, 92.9%, 97.6%, 94.1%, 90.9%, 100%, 100%,
95.5%, 100%, 100%, 100%, 96.3%, 94.4%, 98.0%, 96.3%, 94.4%, 100%,
100%, 98.2%, 97.7%, 92.0%, 100%, 98.3%, 95.3%, 98.2%, 92.0%, 97.7%,
84.5%, 94.2%, 87.1%, 94.7%, 91.7%, 89.6%, 91.1%, 90.4%, 98.2%,
86.5%, 98.6%, 85.7%

Note on temperature and acceptance rate: All benchmarks were measured at temperature 0.6 (Qwen's recommended setting for thinking-mode coding tasks). Lower temperature produces peakier distributions with fewer plausible next tokens, which tracks with higher MTP acceptance; higher-temperature creative settings track with lower acceptance. Whether temperature acts on the MTP head's own draft distribution directly, or whether this is just a byproduct of less-predictable content, is not something I've confirmed (see What affects acceptance).

Injection Script

The MTP head was injected using a custom Python script that performs binary-level GGUF patching. The script:

  1. Reads the donor GGUF with the gguf Python library to extract MTP tensors
  2. Copies the target GGUF's header and KV metadata as raw bytes (no re-serialization)
  3. Appends the nextn_predict_layers = 1 metadata entry
  4. Copies original tensor info verbatim, appends MTP tensor info entries
  5. Copies all original tensor data byte-for-byte, appends MTP tensor data
  6. Patches block_count from 96 to 97

This approach preserves every byte of the original model's tensor data — no re-quantization, no shape re-serialization. Because it's a raw byte-copy, the head is carried at whatever precision the donor GGUF used, which is why the three quants here ship heads of different precision. The script is available in this repository as inject_mtp_40b.py.

Lineage

Qwen3.6-27B (base, 64 layers)
├── DavidAU: Heretic abliteration
├── DavidAU: Deckard fine-tune (5 datasets)
├── DavidAU: Layer expansion to 40B (96 layers)
├── DavidAU: Claude 4.6 Opus reasoning distillation
├── DavidAU: NEO-CODE Di-IMatrix quantization (dual imatrix; Q6_K ~97% BF16)
└── williampieh: MTP head injection from base 27B (blk.96, donor-native precision per quant)
                 + vision via external 27B mmproj (5120-compatible)

Credits

  • DavidAU — Original Qwen3.6-40B Deckard model creation, expansion, fine-tuning, and GGUF quantization
  • Qwen Team (Alibaba) — Qwen3.6-27B base model and MTP architecture
  • am17an — llama.cpp MTP support PR
  • froggeric — Qwen3.6-27B mmproj used for vision, and documentation that vision + MTP works on b9240+

License

Apache 2.0 (inherited from base model)

About

Created by William Pieh / PiehSoft LLC. MTP injection tooling and methodology developed in collaboration with Claude (Anthropic).

Downloads last month
1,215
GGUF
Model size
39B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

5-bit

6-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for PiehSoft/Qwen3.6-40B-Deckard-MTP