# Gemma 4 12B QAT MTP drafter Multi-Token Prediction (MTP) drafter for `unsloth/gemma-4-12B-it-qat-GGUF`. It runs as a speculative draft model that shares the target's KV cache and speeds up text generation. The drafter is the Gemma 4 12B QAT assistant and pairs with the QAT model unchanged. Verified on a single B200 against the `gemma-4-12B-it-qat-UD-Q4_K_XL.gguf` target with `-hf` auto-discovery: draft acceptance 0.51. MTP was merged into llama.cpp on 2026-06-07 (PR ggml-org/llama.cpp#23398). You need a llama.cpp build from after that date. Older builds cannot load these (arch `gemma4-assistant`). ## Files The recommended drafter is a **smart Q4_0**: the native 4-bit QAT drafter (about 97% of its weights are byte-exact on the int4 grid), near-lossless versus higher precision while roughly half the size. It sits at the repo root as `mtp-gemma-4-12B-it.gguf` so `-hf` finds it automatically, and the same file plus higher-precision drafters are in `MTP/`: - `mtp-gemma-4-12B-it.gguf` (repo root, smart Q4_0, recommended; used by `-hf`) - `MTP/gemma-4-12B-it-Q4_0-MTP.gguf` (same smart Q4_0) - `MTP/gemma-4-12B-it-Q8_0-MTP.gguf` - `MTP/gemma-4-12B-it-BF16-MTP.gguf` - `MTP/gemma-4-12B-it-F16-MTP.gguf` ## Build llama.cpp ```bash git clone https://github.com/ggml-org/llama.cpp cd llama.cpp # CUDA build. Set the arch for your GPU: 89 (RTX 4090), 90 (H100), 100 (B200). cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90 cmake --build build --config Release -j --target llama-server ``` ## Run, the easy way A recent llama.cpp finds the drafter automatically from the root `mtp-` file, so `-hf` is all you need. No `--model-draft`. ```bash ./build/bin/llama-server \ -hf unsloth/gemma-4-12B-it-qat-GGUF:UD-Q4_K_XL \ --spec-type draft-mtp --spec-draft-n-max 4 \ -ngl 999 -fa on ``` If your build is too old to auto-discover the sibling, use the explicit form below. ## Run with an explicit drafter Use this to choose a precision or point at a local file. ```bash hf download unsloth/gemma-4-12B-it-qat-GGUF gemma-4-12B-it-qat-UD-Q4_K_XL.gguf --local-dir . hf download unsloth/gemma-4-12B-it-qat-GGUF MTP/gemma-4-12B-it-Q8_0-MTP.gguf --local-dir . ./build/bin/llama-server \ -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \ --model-draft MTP/gemma-4-12B-it-Q8_0-MTP.gguf \ --spec-type draft-mtp --spec-draft-n-max 4 \ -ngl 999 -fa on ``` Multi GPU: add `--spec-draft-device CUDA0 -sm layer`. The drafter pairs with any quant of the 12B QAT model. Quantized KV cache works.