# Gemma 4 12B QAT MTP drafter

Multi-Token Prediction (MTP) drafter for `unsloth/gemma-4-12B-it-qat-GGUF`. It runs as a speculative draft model that shares the target's KV cache and speeds up text generation. The drafter is the Gemma 4 12B QAT assistant and pairs with the QAT model unchanged.

Verified on a single B200 against the `gemma-4-12B-it-qat-UD-Q4_K_XL.gguf` target with `-hf` auto-discovery: draft acceptance 0.51.

MTP was merged into llama.cpp on 2026-06-07 (PR ggml-org/llama.cpp#23398). You need a llama.cpp build from after that date. Older builds cannot load these (arch `gemma4-assistant`).

## Files

The recommended drafter is a **smart Q4_0**: the native 4-bit QAT drafter (about 97% of its weights are byte-exact on the int4 grid), near-lossless versus higher precision while roughly half the size. It sits at the repo root as `mtp-gemma-4-12B-it.gguf` so `-hf` finds it automatically, and the same file plus higher-precision drafters are in `MTP/`:

- `mtp-gemma-4-12B-it.gguf` (repo root, smart Q4_0, recommended; used by `-hf`)
- `MTP/gemma-4-12B-it-Q4_0-MTP.gguf` (same smart Q4_0)
- `MTP/gemma-4-12B-it-Q8_0-MTP.gguf`
- `MTP/gemma-4-12B-it-BF16-MTP.gguf`
- `MTP/gemma-4-12B-it-F16-MTP.gguf`

## Build llama.cpp

```bash
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

# CUDA build. Set the arch for your GPU: 89 (RTX 4090), 90 (H100), 100 (B200).
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=90
cmake --build build --config Release -j --target llama-server
```

## Run, the easy way

A recent llama.cpp finds the drafter automatically from the root `mtp-` file, so `-hf` is all you need. No `--model-draft`.

```bash
./build/bin/llama-server \
  -hf unsloth/gemma-4-12B-it-qat-GGUF:UD-Q4_K_XL \
  --spec-type draft-mtp --spec-draft-n-max 4 \
  -ngl 999 -fa on
```

If your build is too old to auto-discover the sibling, use the explicit form below.

## Run with an explicit drafter

Use this to choose a precision or point at a local file.

```bash
hf download unsloth/gemma-4-12B-it-qat-GGUF gemma-4-12B-it-qat-UD-Q4_K_XL.gguf --local-dir .
hf download unsloth/gemma-4-12B-it-qat-GGUF MTP/gemma-4-12B-it-Q8_0-MTP.gguf --local-dir .

./build/bin/llama-server \
  -m gemma-4-12B-it-qat-UD-Q4_K_XL.gguf \
  --model-draft MTP/gemma-4-12B-it-Q8_0-MTP.gguf \
  --spec-type draft-mtp --spec-draft-n-max 4 \
  -ngl 999 -fa on
```

Multi GPU: add `--spec-draft-device CUDA0 -sm layer`. The drafter pairs with any quant of the 12B QAT model. Quantized KV cache works.