---
license: apache-2.0
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
library_name: llama.cpp
pipeline_tag: text-generation
base_model: google/gemma-4-E4B-it-assistant
base_model_relation: quantized
tags:
  - gguf
  - llama.cpp
  - mtp
  - multi-token-prediction
  - speculative-decoding
  - gemma
  - gemma-4
  - atomic-chat
  - turboquant
---

# Gemma 4 E4B Assistant — GGUF (Atomic Chat)

GGUF builds of [`google/gemma-4-E4B-it-assistant`](https://huggingface.co/google/gemma-4-E4B-it-assistant) — the official Gemma 4
**Multi-Token Prediction (MTP)** drafter for
[`google/gemma-4-E4B-it`](https://huggingface.co/google/gemma-4-E4B-it). Use it as a speculative-decoding
draft model alongside the matching Gemma 4 target to get a meaningful decoding
speedup at zero quality loss.

Approximate size: **78.8M (assistant) / 8B target**.

> [!IMPORTANT]
> These GGUFs use the custom `gemma4_assistant` architecture and **will not
> load in stock `llama.cpp`**. They require the
> [`atomic-llama-cpp-turboquant`](https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant) fork, which adds:
> - the `gemma4_assistant` MTP drafter arch (incl. the centroid LM head for E2B/E4B),
> - **TurboQuant** KV-cache quantization (`-ctk turbo3 -ctv turbo3`),
> - the `--mtp-head` / `--spec-type mtp` runtime flags.
>
> Loading these files in upstream `ggml-org/llama.cpp` will fail with an
> unknown architecture error.

## Files

| File | Quant | Size | Notes |
|---|---|---:|---|
| `gemma-4-E4B-it-assistant.F16.gguf` | F16 | 165.8 MB | reference (smallest quality loss vs source) |
| `gemma-4-E4B-it-assistant.Q8_0.gguf` | Q8_0 | 95.6 MB | near-lossless 8-bit |
| `gemma-4-E4B-it-assistant.Q5_K_M.gguf` | Q5_K_M | 76.2 MB | balanced k-quant |
| `gemma-4-E4B-it-assistant.Q4_K_M.gguf` | Q4_K_M | 74.9 MB | recommended default for speculative-decoding draft |
| `gemma-4-E4B-it-assistant.Q4_K_S.gguf` | Q4_K_S | 74.7 MB | smallest k-quant |

For `E2B`/`E4B`, the assistant uses an **ordered-embedding centroid head** (`mtp.centroids.weight` + `mtp.token_ordering.weight`) that compresses the LM head over the 262K-vocab into 2048 centroids; this structure is preserved across every quantization level in this repo.

## Quick start

Build the fork:

```bash
git clone https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant
cd atomic-llama-cpp-turboquant
# Pick one of the platform-specific configurations:
cmake -B build -DGGML_METAL=ON          # Apple Silicon
# cmake -B build -DGGML_CUDA=ON         # NVIDIA
# cmake -B build                        # CPU-only
cmake --build build --target llama-server llama-cli llama-quantize -j
```

Download the assistant drafter (this repo) and the matching Gemma 4 target:

```bash
hf download AtomicChat/gemma-4-E4B-it-assistant-GGUF \
    --include "*Q4_K_M.gguf" --local-dir ./models
# Any GGUF build of the matching target model works; e.g. unsloth's:
hf download unsloth/gemma-4-E4B-it-GGUF \
    --include "*Q4_K_M*.gguf" --local-dir ./models
```

Run `llama-server` with MTP speculative decoding + TurboQuant KV cache:

```bash
./build/bin/llama-server \
    -m         ./models/gemma-4-E4B-it-Q4_K_M.gguf \
    --mtp-head ./models/gemma-4-E4B-it-assistant.Q4_K_M.gguf \
    --spec-type mtp \
    --draft-block-size 3 --draft-max 8 --draft-min 0 \
    -ngl 99 -ngld 99 \
    -ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3 \
    -fa on -c 16384 --host 127.0.0.1 --port 8080
```

A ready-made launcher lives at `scripts/run-gemma4-e4b-mtp-server.sh`
in the fork (`MTP_PRESET=throughput|lift|balanced|quality`).

## How MTP works here

Gemma 4 ships with a small "assistant" head that predicts several future tokens
from the target model's last hidden state. In `atomic-llama-cpp-turboquant` it
is loaded as a separate GGUF via `--mtp-head` and drives a custom speculative
decoder (block_size 2-3, draft_max 6-8 typical). The verifier runs the target model in
parallel, guaranteeing the same output distribution as plain greedy/sampled
decoding.

## TurboQuant KV cache

`turbo3` is the KV-cache quantization scheme used in this fork; it significantly
reduces KV memory and bandwidth at long contexts with no measurable quality
regression on Gemma 4. Apply it to both target and drafter via
`-ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3`.

## License & attribution

Released under the **Gemma Terms of Use**.

- Original model card: [`google/gemma-4-E4B-it-assistant`](https://huggingface.co/google/gemma-4-E4B-it-assistant)
- Target model: [`google/gemma-4-E4B-it`](https://huggingface.co/google/gemma-4-E4B-it)
- License text: <https://ai.google.dev/gemma/docs/gemma_4_license>

## Acknowledgements

- [Google DeepMind](https://deepmind.google/models/gemma/) — Gemma 4 family and the MTP drafters.
- [`ggml-org/llama.cpp`](https://github.com/ggml-org/llama.cpp) — upstream inference engine.
- TurboQuant primitives — KV-cache quantization scheme integrated in the fork.

— [Atomic Chat](https://atomic.chat)