Qwen3.6-27B-MTP-IQ4_XS-GGUF

Qwen3.6-27B with NextN/MTP (Multi-Token Prediction) speculative-decoding head, quantized to IQ4_XS for single-GPU 24GB inference.

What this is

The published Qwen3.6 family ships with native NextN MTP heads embedded in the safetensors. Most public GGUFs strip these. This conversion preserves them, producing a single GGUF that:

  • Loads as qwen35moe_mtp arch in a patched llama.cpp
  • Serves at ~2ร— decode speed vs. the same trunk without MTP
  • Fits in ~15 GB VRAM at IQ4_XS + q4/q4 KV
  • Native 262K context

Build pipeline

Source: Qwen/Qwen3.6-27B (HF safetensors, with NextN tensors).

  1. Clone llama.cpp at the crucible-mtp branch on llama.cpp (patched) (Aman Gupta's MTP fork) โ€” adds LLM_ARCH_QWEN35MOE_MTP + the NextN draft path.
  2. Run convert_hf_to_gguf.py against the HF repo. Produces a BF16 GGUF with arch qwen35moe_mtp and the NextN tensors fused in.
  3. Quantize to IQ4_XS via llama-quantize.
python convert_hf_to_gguf.py /path/to/Qwen3.6-27B \
    --outfile Qwen3.6-27B-MTP-bf16.gguf
llama-quantize Qwen3.6-27B-MTP-bf16.gguf \
    Qwen3.6-27B-MTP-IQ4_XS.gguf IQ4_XS

Optimal serving config (RTX 3090 Ti, 24 GB)

Cherry-pick PRs #20819 + #20822 for cross-process KV-slot save/restore (we use these for sub-second resume on long contexts).

llama-server \
  -m Qwen3.6-27B-MTP-IQ4_XS.gguf \
  -ngl 999 -fa on \
  --spec-type mtp --spec-draft-n-max 4 \
  --no-mmap \
  --ctx-size 262144 \
  --batch-size 1024 --ubatch-size 512 \
  -ctk q4_0 -ctv q4_0 \
  --parallel 1 --kv-unified \
  --ctx-checkpoints 8 --checkpoint-every-n-tokens 2048 \
  --cache-ram -1 --cache-idle-slots \
  --metrics --jinja

Why these flags:

  • --spec-type mtp: enables NextN-head draft path (this is the whole point of the MTP variant).
  • --spec-draft-n-max 4: empirically the sweet spot โ€” beyond that, accept rate drops faster than draft count grows.
  • --no-mmap: required for KV-slot persistence + measured ~no perf hit on this rig.
  • -ctk q4_0 -ctv q4_0: dense KV cache fits 262K context inside 24 GB without spilling.
  • --parallel 1: MTP path currently only supports n_parallel=1 upstream.

What NOT to set:

  • -ot (expert offload) โ€” defeats the GPU-resident speedup.
  • -ctk q8_0 at full 262K ctx โ€” overflows VRAM during warmup.

Performance (RTX 3090 Ti, 350 W power limit)

Measured 2026-05-06 at short-context inference, persistence + MTP on:

Metric Value
Decode tok/s (short ctx, no thinking) 100.3 (live measured 2026-05-06, n=4 spec)
Decode tok/s (longer ramp 4Kโ€“256K ctx, mean) 70โ€“73
Draft accept rate (n=4) 86.6%
Speedup vs same trunk without MTP 2.92ร— (33 โ†’ 97 t/s on identical workload)
KV slot restore (typical 50 Kโ€“200 K ctx) 0.16โ€“0.36 s
Cold load (model โ†’ ready) ~5โ€“6 s

Memory footprint at 262 K ctx: ~17 GB (model) + ~5 GB (KV q4/q4) + scratch = ~22.5 GB used, ~1.5 GB headroom on a 24 GB card.

Tokenizer

Inherits Qwen3.6 tokenizer (248,320 vocab, qwen35 pre-tokenizer). Same chat template as upstream Qwen3.6-Instruct. If your runtime errors on Jinja Exception: System message must be at the beginning, use the loosened template at: https://huggingface.co/localweights/qwen36-loose-jinja (single line edit removing the strict-position assertion).

License

Apache 2.0 (inherited from Qwen3.6).

Provenance

Built on Crucible: 9950X / 96 GB DDR5 / RTX 3090 Ti. Same pipeline used for the sibling Qwen3.6-35B-A3B-MTP-IQ4_XS-GGUF repo.

Downloads last month
734
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for localweights/Qwen3.6-27B-MTP-IQ4_XS-GGUF

Base model

Qwen/Qwen3.6-27B
Quantized
(455)
this model