--- license: gemma base_model: google/gemma-4-E4B-it-qat-q4_0-unquantized tags: - coreai - aimodel - apple-silicon - on-device - gemma-4 - qat - gpu-pipelined pipeline_tag: text-generation --- # Gemma 4 E4B (text) — Apple Core AI (`.aimodel`) **Gemma 4 E4B's text decoder converted to Apple's Core AI** (the Core ML successor announced at WWDC26), running on iOS 27 / macOS 27 via Apple's `coreai-pipelined` GPU engine — **zero custom kernels, greedy oracle 8/8 exact vs the fp32 Hugging Face reference on the Mac GPU and the iPhone GPU (iPhone is 24/24 token-identical to the Mac on the determinism probe)**. Converted **directly from Google's official QAT release** [google/gemma-4-E4B-it-qat-q4_0-unquantized](https://huggingface.co/google/gemma-4-E4B-it-qat-q4_0-unquantized): bf16 weights **trained for q4_0 rounding**, and q4_0 *is* this bundle's quantization class (per-block-32 absmax linear int4) — Google publishes these checkpoints as "preserving similar quality to bfloat16", so this int4 conversion carries that guarantee **by design**, not by post-hoc gating. > Requires the iOS 27 / macOS 27 beta. Conversion code, knowledge base, engine patch stack: > **[coreai-model-zoo](https://github.com/john-rocky/coreai-model-zoo)** — > model card: [`zoo/gemma4-e4b.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/gemma4-e4b.md). ## Measured (greedy; M4 Max / iPhone 17 Pro, settled device) | config | files | size | M4 Max decode / prefill | iPhone decode / prefill | |---|---|---|---|---| | ★ **provider** (runs BOTH platforms) | `gpu-pipelined/gemma4_e4b_qat_decode_int4lin/` + `ios-frontend/gemma4_e4b_qat_gather_raw/` | 3.7 + 3.4 GB | 53.2 / 62.6 | **15.1 / 21.3** | | ★ **provider, iPhone-ready AOT** | `gpu-pipelined/gemma4_e4b_qat_decode_int4lin_aotc_h18p/` (precompiled `.aimodelc`, **h18p = iPhone 17 Pro class only**) + the same tables | 3.7 + 3.4 GB | — | same as above — skip the AOT step | | **tbl** (Mac-fastest) | `gpu-pipelined/gemma4_e4b_qat_decode_int4lin_tbl/` + the two `embed_per_layer.*` table files | 3.7 + 2.7 GB | **55.8** / 61.0 | not viable (3.7 GB graph + 2.7 GB owned tables > the ~6.4 GB entitled limit) | On iPhone the working set stays tiny — measured peak footprint **2.2 GB** (4.2 GB headroom): the PLE table rides as a clean mmap and the AOT executable pages are evictable. Both phases land exactly on the bandwidth model (~2.1 GB int4/token). ## What E4B is (config + checkpoint verified) Clean **dense** model — no MoE. 42 layers (full attention every 6th), hidden 2560, intermediate 10240 uniform, 8 query heads / **2 KV heads**, dual head_dim 256/512, 18 KV-shared layers (the engine bundle stacks the 24 non-shared layers into ONE unified padded KV pair), per-layer embeddings (the [262144, 10752] int8 table ships in `ios-frontend/gemma4_e4b_qat_gather_raw/`), final-logit softcap 30. The QAT checkpoint prunes the never-used KV projections on the shared layers — the zoo's loader handles both layouts. ## Run contract (each item is load-bearing) Full story + traps: [pipelined-engine page](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md). 1. Swift stack = `apple/coreai-models` + the zoo's patch stack ([`apps/*.patch`](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps), in order). The ★ provider bundle needs `EngineOptions.perTokenInputProvider` (`coreai-pipelined-per-token-inputs.patch`); the tbl bundle needs `EngineOptions.staticInputBuffers` (`coreai-pipelined-static-inputs.patch`). 2. Provider mode: per token, fill `ple_tokens [1,1,42,256]` fp16 from the table dump — `row = i8[id] * scale[id] * sqrt(256)`, mmap-gathered (~0.1 ms). tbl mode: bind `ple_table` ← `embed_per_layer.i8` and `ple_scale` ← `embed_per_layer.scale.f32` as **OWNED `storageModeShared` MTLBuffers** (buffer-backing traps in the knowledge page). 3. `COREAI_CHUNK_THRESHOLD=1` **before** engine creation; **never call `engine.warmup()`** (S=1 graph; a 1-token generate after load is the warmup). 4. iPhone: **AOT is mandatory** (the 3.7 GB-constants graph crashes the on-device specializer) — use the precompiled `_aotc_h18p/` bundle, or `xcrun coreai-build compile .aimodel --platform iOS --preferred-compute gpu --architecture h18p --expect-frequent-reshapes` and point `metadata.json`'s `assets.main` at the `.aimodelc`. Ship the `com.apple.developer.kernel.increased-memory-limit` entitlement as headroom insurance, and bench a **settled** device (a just-unlocked iPhone under-reads ~35%). Reproduce from scratch (oracle + tables are checkpoint-derived — regenerate for any new weights): [`conversion/export_gemma4_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_gemma4_decode_pipelined.py) with `--hf-id google/gemma-4-E4B-it-qat-q4_0-unquantized`. ## License Gemma is provided under and subject to the **Gemma Terms of Use** (https://ai.google.dev/gemma/terms). These `.aimodel` bundles are Model Derivatives of [google/gemma-4-E4B-it-qat-q4_0-unquantized](https://huggingface.co/google/gemma-4-E4B-it-qat-q4_0-unquantized); by downloading or using them you agree to those terms, including the [Gemma Prohibited Use Policy](https://ai.google.dev/gemma/prohibited_use_policy). Sibling repo (E2B, incl. its own official-QAT bundles): [gemma-4-E2B-CoreAI](https://huggingface.co/mlboydaisuke/gemma-4-E2B-CoreAI).