--- license: apache-2.0 license_link: https://ai.google.dev/gemma/docs/gemma_4_license library_name: llama.cpp pipeline_tag: text-generation base_model: google/gemma-4-E4B-it-assistant base_model_relation: quantized tags: - gguf - llama.cpp - mtp - multi-token-prediction - speculative-decoding - gemma - gemma-4 - atomic-chat - turboquant --- # Gemma 4 E4B Assistant — GGUF (Atomic Chat) GGUF builds of [`google/gemma-4-E4B-it-assistant`](https://huggingface.co/google/gemma-4-E4B-it-assistant) — the official Gemma 4 **Multi-Token Prediction (MTP)** drafter for [`google/gemma-4-E4B-it`](https://huggingface.co/google/gemma-4-E4B-it). Use it as a speculative-decoding draft model alongside the matching Gemma 4 target to get a meaningful decoding speedup at zero quality loss. Approximate size: **78.8M (assistant) / 8B target**. > [!IMPORTANT] > These GGUFs use the custom `gemma4_assistant` architecture and **will not > load in stock `llama.cpp`**. They require the > [`atomic-llama-cpp-turboquant`](https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant) fork, which adds: > - the `gemma4_assistant` MTP drafter arch (incl. the centroid LM head for E2B/E4B), > - **TurboQuant** KV-cache quantization (`-ctk turbo3 -ctv turbo3`), > - the `--mtp-head` / `--spec-type mtp` runtime flags. > > Loading these files in upstream `ggml-org/llama.cpp` will fail with an > unknown architecture error. ## Files | File | Quant | Size | Notes | |---|---|---:|---| | `gemma-4-E4B-it-assistant.F16.gguf` | F16 | 165.8 MB | reference (smallest quality loss vs source) | | `gemma-4-E4B-it-assistant.Q8_0.gguf` | Q8_0 | 95.6 MB | near-lossless 8-bit | | `gemma-4-E4B-it-assistant.Q5_K_M.gguf` | Q5_K_M | 76.2 MB | balanced k-quant | | `gemma-4-E4B-it-assistant.Q4_K_M.gguf` | Q4_K_M | 74.9 MB | recommended default for speculative-decoding draft | | `gemma-4-E4B-it-assistant.Q4_K_S.gguf` | Q4_K_S | 74.7 MB | smallest k-quant | For `E2B`/`E4B`, the assistant uses an **ordered-embedding centroid head** (`mtp.centroids.weight` + `mtp.token_ordering.weight`) that compresses the LM head over the 262K-vocab into 2048 centroids; this structure is preserved across every quantization level in this repo. ## Quick start Build the fork: ```bash git clone https://github.com/AtomicBot-ai/atomic-llama-cpp-turboquant cd atomic-llama-cpp-turboquant # Pick one of the platform-specific configurations: cmake -B build -DGGML_METAL=ON # Apple Silicon # cmake -B build -DGGML_CUDA=ON # NVIDIA # cmake -B build # CPU-only cmake --build build --target llama-server llama-cli llama-quantize -j ``` Download the assistant drafter (this repo) and the matching Gemma 4 target: ```bash hf download AtomicChat/gemma-4-E4B-it-assistant-GGUF \ --include "*Q4_K_M.gguf" --local-dir ./models # Any GGUF build of the matching target model works; e.g. unsloth's: hf download unsloth/gemma-4-E4B-it-GGUF \ --include "*Q4_K_M*.gguf" --local-dir ./models ``` Run `llama-server` with MTP speculative decoding + TurboQuant KV cache: ```bash ./build/bin/llama-server \ -m ./models/gemma-4-E4B-it-Q4_K_M.gguf \ --mtp-head ./models/gemma-4-E4B-it-assistant.Q4_K_M.gguf \ --spec-type mtp \ --draft-block-size 3 --draft-max 8 --draft-min 0 \ -ngl 99 -ngld 99 \ -ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3 \ -fa on -c 16384 --host 127.0.0.1 --port 8080 ``` A ready-made launcher lives at `scripts/run-gemma4-e4b-mtp-server.sh` in the fork (`MTP_PRESET=throughput|lift|balanced|quality`). ## How MTP works here Gemma 4 ships with a small "assistant" head that predicts several future tokens from the target model's last hidden state. In `atomic-llama-cpp-turboquant` it is loaded as a separate GGUF via `--mtp-head` and drives a custom speculative decoder (block_size 2-3, draft_max 6-8 typical). The verifier runs the target model in parallel, guaranteeing the same output distribution as plain greedy/sampled decoding. ## TurboQuant KV cache `turbo3` is the KV-cache quantization scheme used in this fork; it significantly reduces KV memory and bandwidth at long contexts with no measurable quality regression on Gemma 4. Apply it to both target and drafter via `-ctk turbo3 -ctv turbo3 -ctkd turbo3 -ctvd turbo3`. ## License & attribution Released under the **Gemma Terms of Use**. - Original model card: [`google/gemma-4-E4B-it-assistant`](https://huggingface.co/google/gemma-4-E4B-it-assistant) - Target model: [`google/gemma-4-E4B-it`](https://huggingface.co/google/gemma-4-E4B-it) - License text: ## Acknowledgements - [Google DeepMind](https://deepmind.google/models/gemma/) — Gemma 4 family and the MTP drafters. - [`ggml-org/llama.cpp`](https://github.com/ggml-org/llama.cpp) — upstream inference engine. - TurboQuant primitives — KV-cache quantization scheme integrated in the fork. — [Atomic Chat](https://atomic.chat)