Instructions to use PiehSoft/Qwen3.6-40B-Deckard-MTP with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use PiehSoft/Qwen3.6-40B-Deckard-MTP with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="PiehSoft/Qwen3.6-40B-Deckard-MTP", filename="Qwen3.6-40B-Deckard-MTP-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use PiehSoft/Qwen3.6-40B-Deckard-MTP with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M # Run inference directly in the terminal: llama-cli -hf PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M # Run inference directly in the terminal: llama-cli -hf PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M
Use Docker
docker model run hf.co/PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use PiehSoft/Qwen3.6-40B-Deckard-MTP with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "PiehSoft/Qwen3.6-40B-Deckard-MTP" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "PiehSoft/Qwen3.6-40B-Deckard-MTP", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M
- Ollama
How to use PiehSoft/Qwen3.6-40B-Deckard-MTP with Ollama:
ollama run hf.co/PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M
- Unsloth Studio
How to use PiehSoft/Qwen3.6-40B-Deckard-MTP with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for PiehSoft/Qwen3.6-40B-Deckard-MTP to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for PiehSoft/Qwen3.6-40B-Deckard-MTP to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for PiehSoft/Qwen3.6-40B-Deckard-MTP to start chatting
- Pi
How to use PiehSoft/Qwen3.6-40B-Deckard-MTP with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use PiehSoft/Qwen3.6-40B-Deckard-MTP with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use PiehSoft/Qwen3.6-40B-Deckard-MTP with Docker Model Runner:
docker model run hf.co/PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M
- Lemonade
How to use PiehSoft/Qwen3.6-40B-Deckard-MTP with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull PiehSoft/Qwen3.6-40B-Deckard-MTP:Q4_K_M
Run and chat with the model
lemonade run user.Qwen3.6-40B-Deckard-MTP-Q4_K_M
List all available models
lemonade list
Qwen3.6-40B-Deckard-MTP GGUF
The first and only GGUFs of DavidAU's Qwen3.6-40B Opus-Deckard with working Multi-Token Prediction (MTP) speculative decoding — and, with an external mmproj, working vision.
This repo takes the Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-MAX-GGUF quants and injects an MTP head transplanted from the base Qwen3.6-27B architecture. No other published GGUF of this model includes MTP support.
Available Quants
| File | Body quant | MTP head | ~Size | Best for |
|---|---|---|---|---|
Qwen3.6-40B-Deckard-MTP-Q6_K.gguf |
Q6_K (~97% BF16) | BF16 | ~31 GB | Highest fidelity; most VRAM; longest-tested |
Qwen3.6-40B-Deckard-MTP-Q5_K_M.gguf |
Q5_K_M | Q8_0 | ~28 GB | Balanced |
Qwen3.6-40B-Deckard-MTP-Q4_K_M.gguf |
Q4_K_M | Q4_K | ~24 GB | Lowest VRAM; smallest head |
On the differing head precisions: the MTP head is grafted at whatever precision its donor carried (raw byte-copy, no requantization). I tested head precision against draft acceptance directly — across multiple seeds at draft depth n=3, a higher-precision head (Q8) and a body-matched head (Q4) landed within measurement noise of each other on a Q4 body. Acceptance appears dominated by the body quant and by context/task, not by head precision. So each quant carries a head sized to keep the file small rather than chasing a precision that didn't measurably move acceptance. The Q6_K is the original release and the variant tested over the longest duration (the multi-hour coding-session data below is all Q6_K); it retains its BF16 head for continuity. The Q4 and Q5 are newer and validated on shorter runs.
What's Different
- MTP speculative decoding works out of the box — no separate draft model needed
- Vision works via an external mmproj — the model accepts image input when paired with a Qwen3.6 vision projector, because the expanded 40B preserves the 27B's 5120 hidden dimension (see Vision / Multimodal below)
- MTP and vision run simultaneously — confirmed on llama.cpp b9240+; image processing and MTP speculative decoding co-fire in the same request
- MTP head grafted from base 27B, not fine-tuned — head precision per quant chosen for footprint, not acceptance (tested within noise across seeds; see Available Quants)
- High sustained acceptance — 85-100% in established conversation context on coding tasks (temp 0.6, thinking mode); lower on fresh/short context, on image turns, and on less predictable content like creative writing (see What affects acceptance)
- ~40% generation speedup — 50-58 t/s vs ~40 t/s baseline on an RTX PRO 6000 Blackwell
How This Was Made
DavidAU's 40B Deckard model was expanded from the base Qwen3.6-27B (64 layers → 96 layers, same hidden dimension of 5120). The expansion preserved the model width but did not include the MTP head from the base architecture.
The MTP head is architecturally a single transformer block (attention + SwiGLU FFN) plus projection layers (eh_proj, enorm, hnorm, shared_head_norm) that takes the main model's hidden state and predicts the next token. Since the hidden dimension (5120) is identical between the 27B and the expanded 40B, the MTP head tensors are dimensionally compatible.
The injection process:
- Extracted all 15 MTP tensors from
blk.64of a Qwen3.6-27B donor GGUF (at the donor's native precision) - Remapped them to
blk.96(the MTP layer index for the 97-block 40B model) - Binary-patched the target GGUF: inserted
nextn_predict_layers = 1metadata, updatedblock_countfrom 96 to 97, appended MTP tensor info and data - Original model tensor data is byte-for-byte identical to the source quant — zero re-serialization of existing weights
The MTP head was not fine-tuned on the 40B's hidden states. Acceptance comes purely from the dimensional compatibility between the base 27B and the expanded 40B (shared 5120 hidden dim). Measured per-position acceptance on a coding task (temp 0.6, thinking on): ~0.91 / 0.82 / 0.74 at draft depth n=1 / 2 / 3 on fresh context, rising to 85-100% in sustained conversation. This is comparable to — and at times better than — a natively-trained 27B MTP head at the same draft depths, which is notable given this head received zero training; your mileage will vary by task and context. Self-distillation on the 40B's actual output distribution would likely lift the fresh-context and image-turn rates further.
What affects acceptance
MTP acceptance is not a single fixed number — it depends heavily on how predictable the next tokens are given the model's internal hidden state. This matters when choosing a draft depth and when interpreting the numbers below.
- Highly predictable content (code, structured output, established conversation context): the next token is strongly determined by the hidden state, so the MTP head's drafts match the verifier often. This is where MTP shines — high acceptance, big speedups.
- Less predictable content (open-ended creative writing, fresh context): each token does not as strongly imply the next in a single deterministic direction, so the head's chained drafts diverge from the verifier more often. Expect lower acceptance and a smaller speedup on creative work.
A large part of this is simply that the head was not trained alongside the 40B. A trained head learns the target's hidden-state-to-next-token mapping; a grafted head relies on the borrowed 27B mapping being close enough, which holds best where the next token is "obvious" and degrades where it isn't.
One honest unknown: I have not verified whether the sampler temperature is applied to the MTP head's own draft distribution in the same way it's applied to the main model. Empirically, lower temperature (peakier distributions, fewer plausible next tokens) tracks with higher acceptance, and higher-temperature creative settings track with lower acceptance — but whether that's the temperature acting on the head directly or just the underlying content being less predictable, I can't yet say for certain. Training the head is on the list of things to try; no promises on timeline.
Vision / Multimodal
The 40B Deckard can do image understanding when paired with an external Qwen3.6 vision projector (mmproj). The mmproj is not bundled in this repo — you supply it at launch with --mmproj.
Why a 27B mmproj works on the 40B
The exact same architectural fact that made the MTP graft work makes the vision projector work: the expanded 40B preserves the base 27B's hidden dimension of 5120. An mmproj projects encoded image features into the model's embedding space at n_embd, so a projector built against a 5120-wide Qwen3.6 model is dimensionally compatible with this 40B regardless of its greater depth. The projector fits the socket; the extra layers are downstream of where image embeddings inject. This is the same "interface is the embedding width, not the layer count" principle behind the MTP head.
The projector used and validated here is mmproj-Qwen3.6-27B-f16.gguf from froggeric's repo (~1.16 GiB worst-case VRAM).
Confirmed behavior
Validated on an RTX PRO 6000 Blackwell with llama.cpp b9352:
- Accurate fine text reading — correctly reads small UI labels, clock times, and dropdown values from screenshots
- Layout and UX reasoning — identifies structural redundancy, infers interaction models (e.g. single-click vs double-click navigation) from static frames, not just object labeling
- Multi-turn visual memory — holds and ranks multiple images across a conversation, self-corrects when a duplicate image is sent
- Vision + MTP together — image processing (~2.4 s for the first image, ~0.8 s for subsequent) and MTP speculative decoding co-fire in the same request; decode held ~48-54 t/s
MTP acceptance on vision turns
MTP continues to draft and accept during image turns, but at a lower rate than pure text:
| Turn type | MTP acceptance |
|---|---|
| Pure text (in-conversation) | 85-100% |
| Image turns | ~49% |
This is expected and benign. The MTP head was grafted against the model's text distribution, so image-token sequences are out-of-distribution for the draft head — its predictions around the image are less accurate. Acceptance does not collapse to zero, so MTP remains worth running on vision turns (you still draft and land roughly half), and it returns to the full 85-100% range on the text turns of the same conversation. A vision-aware MTP head (self-distilled on multimodal hidden states) would lift the image-turn rate, but that is a research project, not a fix.
Note on
find_slot: non-consecutive token positionwarnings: When an image is injected mid-sequence on this hybrid GDN + MTP + checkpoint stack, llama.cpp emits a burst ofnon-consecutive token positionwarnings during image processing. In testing these were noisy but benign — they did not corrupt description accuracy or break MTP drafting. If you also run context checkpoints, this is the same subsystem tracked in llama.cpp #23371; start at modest context if you hit VRAM pressure.
Launch with vision
./llama-server \
-m Qwen3.6-40B-Deckard-MTP-Q6_K.gguf \
--mmproj mmproj-Qwen3.6-27B-f16.gguf \
--host 0.0.0.0 --port 8080 \
-ngl 999 --flash-attn on --jinja \
--image-min-tokens 1024 \
--spec-type draft-mtp --spec-draft-n-max 2 \
--cache-type-k q8_0 --cache-type-v q8_0 \
--temp 0.6 --top-k 20 --top-p 0.95 --min-p 0
Build requirement for vision + MTP: vision combined with MTP requires llama.cpp b9240 or newer. Earlier builds (the original PR #22673) crashed when combining vision with MTP; this was fixed in mainline. --image-min-tokens 1024 is recommended for Qwen-VL grounding accuracy on dense images.
Client note (OpenCode and other OpenAI-compatible clients): some clients strip image attachments unless the custom model is declared vision-capable. For OpenCode (@ai-sdk/openai-compatible), add a modalities block to the model config so images reach the server:
"qwen36-40b-deckard": {
"name": "Qwen3.6 40B Deckard",
"modalities": { "input": ["text", "image"], "output": ["text"] }
}
To confirm the server side independent of any client, send an image directly to /v1/chat/completions with an image_url content part and check for accurate description.
Model Specifications (Q6_K)
The table below describes the Q6_K variant. The Q5_K_M and Q4_K_M differ in body quant, MTP head precision, total tensor types, and file size — see Available Quants for the per-file summary.
| Parameter | Value |
|---|---|
| Base Model | Qwen3.6-40B-Claude-4.6-Opus-Deckard-Heretic-Uncensored-Thinking-NEO-CODE-Di-IMatrix-MAX-GGUF |
| Architecture | qwen35 (dense, not MoE) |
| Parameters | 40B (expanded from 27B) |
| Layers | 96 main + 1 MTP head = 97 total |
| Hidden Dimension | 5120 |
| Quantization | NEO-CODE Di-IMatrix Q6_K (main model, ~97% of BF16) + BF16 (MTP head) |
| Total Tensors | 1290 (1275 original + 15 MTP) |
| File Size | ~30.97 GB |
| Context Length | 262,144 tokens |
| MTP Donor | Qwen3.6-27B (BF16 safetensors) |
| Vision | Supported via external mmproj (5120-compatible Qwen3.6 projector); not bundled |
| Vision MTP Acceptance | ~49% on image turns (text-grafted head, out-of-distribution on image tokens) |
MTP Head Tensors
The following 15 tensors were injected at blk.96 (tensor types shown for the Q6_K variant, where the head is BF16; the Q5 head is Q8_0 and the Q4 head is Q4_K):
| Tensor | Shape | Type (Q6_K build) |
|---|---|---|
blk.96.nextn.eh_proj.weight |
[10240, 5120] | BF16 |
blk.96.ffn_down.weight |
[17408, 5120] | BF16 |
blk.96.ffn_gate.weight |
[5120, 17408] | BF16 |
blk.96.ffn_up.weight |
[5120, 17408] | BF16 |
blk.96.attn_k.weight |
[5120, 1024] | BF16 |
blk.96.attn_q.weight |
[5120, 12288] | BF16 |
blk.96.attn_v.weight |
[5120, 1024] | BF16 |
blk.96.attn_output.weight |
[6144, 5120] | BF16 |
blk.96.attn_norm.weight |
[5120] | F32 |
blk.96.post_attention_norm.weight |
[5120] | F32 |
blk.96.attn_k_norm.weight |
[256] | F32 |
blk.96.attn_q_norm.weight |
[256] | F32 |
blk.96.nextn.shared_head_norm.weight |
[5120] | F32 |
blk.96.nextn.enorm.weight |
[5120] | F32 |
blk.96.nextn.hnorm.weight |
[5120] | F32 |
Recommended Settings
llama.cpp / llama-server
./llama-server \
-m Qwen3.6-40B-Deckard-MTP-Q6_K.gguf \
--host 0.0.0.0 --port 8080 \
-ngl 999 --flash-attn on --jinja \
--spec-type draft-mtp --spec-draft-n-max 2 \
--temp 0.6 --top-k 20 --top-p 0.95
Swap -m for the Q5 or Q4 file as needed. For a vision-enabled launch, see Vision / Multimodal → Launch with vision.
On draft depth (--spec-draft-n-max): n=2 is a good default and tends to be the throughput sweet spot for predictable content like code — you draft more tokens per pass without acceptance falling far enough to hurt. n=1 is the conservative floor (highest per-token acceptance, smallest speedup). n=3 can win on very structured output but degrades faster on creative/open-ended text. Higher draft depths reward predictable content and penalize unpredictable content — tune to your workload.
Build requirement: MTP support requires llama.cpp with PR #22673 merged (mainline as of late May 2026). MTP + vision together requires b9240 or newer.
Sampling Parameters
Based on Qwen's official recommendations for the base architecture:
| Use Case | Temperature | Top-P | Top-K | Presence Penalty |
|---|---|---|---|---|
| Coding (thinking mode) | 0.6 | 0.95 | 20 | 0.0 |
| General (thinking mode) | 1.0 | 0.95 | 20 | 1.5 |
| General (instruct/no-think) | 0.7 | 0.8 | 20 | 1.5 |
DavidAU's additional guidance: rep_pen 1.05–1.1 for creative work with lower quants. Min context window 8K–16K.
VRAM Notes
Qwen3.6 uses a hybrid GDN (Gated DeltaNet) + full attention architecture at a 3:1 ratio. In the 40B (96 layers), 72 layers are GDN with fixed-size recurrent state (~225 MiB, constant regardless of context length) and 24 layers use full attention with traditional KV cache.
For reference, the base 27B (16 attention layers) uses ~150 MiB recurrent state and ~64 KB/token for KV cache at FP16. The 40B has 1.5x the attention layers (24 vs 16), so expect roughly 1.5x the KV cache cost per token. With KV cache quantization (--cache-type-k q8_0 --cache-type-v q8_0 or TurboQuant 3-bit), this drops substantially.
| Component | Size (Q6_K) |
|---|---|
| Model weights (Q6_K) + MTP head (BF16) | ~31 GB |
| Recurrent state (fixed) | ~225 MiB |
| mmproj vision encoder (when loaded, f16) | ~1.16 GiB |
| KV cache per token (FP16, 24 attn layers) | ~96 KB |
| KV cache at 32K context (FP16) | ~3 GB |
| KV cache at 128K context (FP16) | ~12 GB |
| KV cache at 262K context (FP16) | ~25 GB |
The Q5_K_M (28 GB) and Q4_K_M (24 GB) reduce the weights line accordingly; KV cache, recurrent state, and mmproj figures are unchanged since they depend on architecture and context, not body quant. These are estimates extrapolated from measured 27B numbers scaled by the 1.5x attention layer ratio. Actual usage depends on your --cache-type-k/v settings, batch size, and framework overhead. With q8_0 cache quantization, halve the KV cache numbers. With TurboQuant 3-bit, divide by ~4.6x.
Context scaling note: in llama-server, -c is the total KV budget across all slots and is divided by --parallel. To give each of N parallel slots a target context, set -c = (per-slot target × N), capped per-slot at the 262K native limit (YaRN required beyond, with a short-context quality tax). Concurrency comes from parallel slots on one model load — you do not need separate model instances for more concurrent agents.
Benchmarks
Measured on NVIDIA RTX PRO 6000 Blackwell (96 GB VRAM, 1,792 GB/s bandwidth). The MTP head is untrained — these results are achieved purely from dimensional compatibility between the 27B donor and the 40B expanded model. The sustained-session data below is from the Q6_K variant, which has been tested over the longest duration. Q4 and Q5 were validated on shorter runs and land in a similar band; acceptance is dominated by body quant, context, and content predictability rather than head precision.
Per-position acceptance (fresh context, coding task, temp 0.6, thinking on)
| Draft depth | Acceptance |
|---|---|
| n=1 | ~0.91 |
| n=2 | ~0.82 |
| n=3 | ~0.74 |
For loose reference, a natively-trained 27B MTP head sits in a broadly similar range at these depths (roughly high-0.8s falling toward low-0.6s by n=3 in third-party reports). This grafted head is comparable, and sometimes better — encouraging for a head that received no training — but cross-setup MTP numbers are measured under differing runtimes, quants, and conditions, so treat any head-to-head as approximate.
Sustained acceptance (Q6_K, in-conversation, by context depth)
| Context Depth | Acceptance Rate | Notes |
|---|---|---|
| Fresh context (~5K tokens) | ~72% | Cold start, no prior conversation |
| Mid conversation (~55-65K) | 95-100% | Seven consecutive 100% runs observed |
| Deep context (~65-80K) | 85-98% | Sustained high acceptance |
| Very deep context (~80-87K) | 86-98% | No degradation at depth |
| Image turns (vision) | ~49% | Text-grafted head is OOD on image tokens; does not collapse |
Acceptance rate improves as conversation context builds — the model's output distribution narrows within an established context, making MTP predictions more accurate.
Throughput (Q6_K)
| Metric | With MTP | Without MTP (baseline) |
|---|---|---|
| Generation (fresh context) | 56-58 t/s | ~40 t/s |
| Generation (50K+ context) | 50-55 t/s | ~35 t/s |
| Generation (80K+ context) | 50-51 t/s | ~30-35 t/s |
| Generation (vision turns) | ~48-54 t/s | — |
| Prompt processing | ~1,200-1,800 t/s | ~1,200-1,800 t/s |
| Image processing (per image) | ~0.8-2.4 s | — |
| Effective speedup | ~40% | — |
Raw Data (Q6_K)
Acceptance rates from a continuous coding session (~85 request/response cycles, 54K-87K context):
94.7%, 81.3%, 95.2%, 100%, 92.9%, 97.6%, 94.1%, 90.9%, 100%, 100%,
95.5%, 100%, 100%, 100%, 96.3%, 94.4%, 98.0%, 96.3%, 94.4%, 100%,
100%, 98.2%, 97.7%, 92.0%, 100%, 98.3%, 95.3%, 98.2%, 92.0%, 97.7%,
84.5%, 94.2%, 87.1%, 94.7%, 91.7%, 89.6%, 91.1%, 90.4%, 98.2%,
86.5%, 98.6%, 85.7%
Note on temperature and acceptance rate: All benchmarks were measured at temperature 0.6 (Qwen's recommended setting for thinking-mode coding tasks). Lower temperature produces peakier distributions with fewer plausible next tokens, which tracks with higher MTP acceptance; higher-temperature creative settings track with lower acceptance. Whether temperature acts on the MTP head's own draft distribution directly, or whether this is just a byproduct of less-predictable content, is not something I've confirmed (see What affects acceptance).
Injection Script
The MTP head was injected using a custom Python script that performs binary-level GGUF patching. The script:
- Reads the donor GGUF with the
ggufPython library to extract MTP tensors - Copies the target GGUF's header and KV metadata as raw bytes (no re-serialization)
- Appends the
nextn_predict_layers = 1metadata entry - Copies original tensor info verbatim, appends MTP tensor info entries
- Copies all original tensor data byte-for-byte, appends MTP tensor data
- Patches
block_countfrom 96 to 97
This approach preserves every byte of the original model's tensor data — no re-quantization, no shape re-serialization. Because it's a raw byte-copy, the head is carried at whatever precision the donor GGUF used, which is why the three quants here ship heads of different precision. The script is available in this repository as inject_mtp_40b.py.
Lineage
Qwen3.6-27B (base, 64 layers)
├── DavidAU: Heretic abliteration
├── DavidAU: Deckard fine-tune (5 datasets)
├── DavidAU: Layer expansion to 40B (96 layers)
├── DavidAU: Claude 4.6 Opus reasoning distillation
├── DavidAU: NEO-CODE Di-IMatrix quantization (dual imatrix; Q6_K ~97% BF16)
└── williampieh: MTP head injection from base 27B (blk.96, donor-native precision per quant)
+ vision via external 27B mmproj (5120-compatible)
Credits
- DavidAU — Original Qwen3.6-40B Deckard model creation, expansion, fine-tuning, and GGUF quantization
- Qwen Team (Alibaba) — Qwen3.6-27B base model and MTP architecture
- am17an — llama.cpp MTP support PR
- froggeric — Qwen3.6-27B mmproj used for vision, and documentation that vision + MTP works on b9240+
License
Apache 2.0 (inherited from base model)
About
Created by William Pieh / PiehSoft LLC. MTP injection tooling and methodology developed in collaboration with Claude (Anthropic).
- Downloads last month
- 1,215
4-bit
5-bit
6-bit