---
license: gemma
base_model: google/gemma-4-E4B-it-qat-q4_0-unquantized
tags:
  - coreai
  - aimodel
  - apple-silicon
  - on-device
  - gemma-4
  - qat
  - gpu-pipelined
pipeline_tag: text-generation
---

# Gemma 4 E4B (text) — Apple Core AI (`.aimodel`)

**Gemma 4 E4B's text decoder converted to Apple's Core AI** (the Core ML successor announced
at WWDC26), running on iOS 27 / macOS 27 via Apple's `coreai-pipelined` GPU engine — **zero
custom kernels, greedy oracle 8/8 exact vs the fp32 Hugging Face reference on the Mac GPU and
the iPhone GPU (iPhone is 24/24 token-identical to the Mac on the determinism probe)**.

Converted **directly from Google's official QAT release**
[google/gemma-4-E4B-it-qat-q4_0-unquantized](https://huggingface.co/google/gemma-4-E4B-it-qat-q4_0-unquantized):
bf16 weights **trained for q4_0 rounding**, and q4_0 *is* this bundle's quantization class
(per-block-32 absmax linear int4) — Google publishes these checkpoints as "preserving similar
quality to bfloat16", so this int4 conversion carries that guarantee **by design**, not by
post-hoc gating.

> Requires the iOS 27 / macOS 27 beta. Conversion code, knowledge base, engine patch stack:
> **[coreai-model-zoo](https://github.com/john-rocky/coreai-model-zoo)** —
> model card: [`zoo/gemma4-e4b.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/gemma4-e4b.md).

## Measured (greedy; M4 Max / iPhone 17 Pro, settled device)

| config | files | size | M4 Max decode / prefill | iPhone decode / prefill |
|---|---|---|---|---|
| ★ **provider** (runs BOTH platforms) | `gpu-pipelined/gemma4_e4b_qat_decode_int4lin/` + `ios-frontend/gemma4_e4b_qat_gather_raw/` | 3.7 + 3.4 GB | 53.2 / 62.6 | **15.1 / 21.3** |
| ★ **provider, iPhone-ready AOT** | `gpu-pipelined/gemma4_e4b_qat_decode_int4lin_aotc_h18p/` (precompiled `.aimodelc`, **h18p = iPhone 17 Pro class only**) + the same tables | 3.7 + 3.4 GB | — | same as above — skip the AOT step |
| **tbl** (Mac-fastest) | `gpu-pipelined/gemma4_e4b_qat_decode_int4lin_tbl/` + the two `embed_per_layer.*` table files | 3.7 + 2.7 GB | **55.8** / 61.0 | not viable (3.7 GB graph + 2.7 GB owned tables > the ~6.4 GB entitled limit) |

On iPhone the working set stays tiny — measured peak footprint **2.2 GB** (4.2 GB headroom):
the PLE table rides as a clean mmap and the AOT executable pages are evictable. Both phases
land exactly on the bandwidth model (~2.1 GB int4/token).

## What E4B is (config + checkpoint verified)

Clean **dense** model — no MoE. 42 layers (full attention every 6th), hidden 2560,
intermediate 10240 uniform, 8 query heads / **2 KV heads**, dual head_dim 256/512, 18
KV-shared layers (the engine bundle stacks the 24 non-shared layers into ONE unified padded
KV pair), per-layer embeddings (the [262144, 10752] int8 table ships in
`ios-frontend/gemma4_e4b_qat_gather_raw/`), final-logit softcap 30. The QAT checkpoint prunes
the never-used KV projections on the shared layers — the zoo's loader handles both layouts.

## Run contract (each item is load-bearing)

Full story + traps:
[pipelined-engine page](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md).

1. Swift stack = `apple/coreai-models` + the zoo's patch stack
   ([`apps/*.patch`](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps), in
   order). The ★ provider bundle needs `EngineOptions.perTokenInputProvider`
   (`coreai-pipelined-per-token-inputs.patch`); the tbl bundle needs
   `EngineOptions.staticInputBuffers` (`coreai-pipelined-static-inputs.patch`).
2. Provider mode: per token, fill `ple_tokens [1,1,42,256]` fp16 from the table dump —
   `row = i8[id] * scale[id] * sqrt(256)`, mmap-gathered (~0.1 ms). tbl mode: bind
   `ple_table` ← `embed_per_layer.i8` and `ple_scale` ← `embed_per_layer.scale.f32` as
   **OWNED `storageModeShared` MTLBuffers** (buffer-backing traps in the knowledge page).
3. `COREAI_CHUNK_THRESHOLD=1` **before** engine creation; **never call `engine.warmup()`**
   (S=1 graph; a 1-token generate after load is the warmup).
4. iPhone: **AOT is mandatory** (the 3.7 GB-constants graph crashes the on-device
   specializer) — use the precompiled `_aotc_h18p/` bundle, or
   `xcrun coreai-build compile <bundle>.aimodel --platform iOS --preferred-compute gpu
   --architecture h18p --expect-frequent-reshapes` and point `metadata.json`'s
   `assets.main` at the `.aimodelc`. Ship the
   `com.apple.developer.kernel.increased-memory-limit` entitlement as headroom insurance,
   and bench a **settled** device (a just-unlocked iPhone under-reads ~35%).

Reproduce from scratch (oracle + tables are checkpoint-derived — regenerate for any new
weights): [`conversion/export_gemma4_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_gemma4_decode_pipelined.py)
with `--hf-id google/gemma-4-E4B-it-qat-q4_0-unquantized`.

## License

Gemma is provided under and subject to the **Gemma Terms of Use**
(https://ai.google.dev/gemma/terms). These `.aimodel` bundles are Model Derivatives of
[google/gemma-4-E4B-it-qat-q4_0-unquantized](https://huggingface.co/google/gemma-4-E4B-it-qat-q4_0-unquantized);
by downloading or using them you agree to those terms, including the
[Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy).

Sibling repo (E2B, incl. its own official-QAT bundles):
[gemma-4-E2B-CoreAI](https://huggingface.co/mlboydaisuke/gemma-4-E2B-CoreAI).