| --- |
| license: gemma |
| base_model: google/gemma-4-31B-it-qat-q4_0-unquantized |
| tags: |
| - core-ai |
| - coreai |
| - apple |
| - gemma |
| - gemma-4 |
| - on-device |
| - metal |
| pipeline_tag: text-generation |
| library_name: coreai |
| --- |
| |
| # Gemma 4 31B (dense) β Core AI |
|
|
| Apple **Core AI** (`.aimodel`) conversion of Google's **Gemma 4 31B** dense text decoder, ported |
| directly from the QAT release |
| [`google/gemma-4-31B-it-qat-q4_0-unquantized`](https://huggingface.co/google/gemma-4-31B-it-qat-q4_0-unquantized). |
| Decode-only, runs on the **stock pipelined engine** on Apple Silicon (Mac-class, ~16 GB). |
|
|
| > **Frontier dense, unblocked by a custom Metal kernel.** Gemma 4 31B's full (global) attention |
| > layers have a 32-head Γ 512 Q tensor that overflows MPSGraph's GPU decode scratch heap β the stock |
| > SDPA crashes at the first token ([apple/coreai-models#27](https://github.com/apple/coreai-models/issues/27), |
| > the same bug as the 12B). This bundle ships a **custom flash-decode SDPA kernel** on the full |
| > layers (block-GQA over the 31B's 4 global KV heads) that removes the offending op, so the model |
| > runs. |
|
|
| ## Bundle (`gpu-pipelined/`) |
|
|
| | bundle | quant | size | decode (M4 Max) | |
| |---|---|---|---| |
| | `gemma4_31b_qat_decode_int4linsym_msdpa_g8` | int4 (q4_0-aligned absmax) | 19 GB | **17.2 tok/s** (prefill 22.1) | |
| |
| int4 from Google's QAT checkpoint (q4_0 grid). A frontier 31B at int4 is bandwidth-bound, so decode |
| is in the MLX-parity range β the value is "Core AI runs a frontier dense model the stock engine |
| cannot." Mac-only (exceeds the iPhone memory budget). The `_g8` suffix is the higher-occupancy |
| flash-decode kernel (8 SIMD-groups per head split the global layers' KV scan; same numerics). |
|
|
| ## Architecture |
|
|
| Clean dense `gemma4` text decoder β **no** PLE / AltUp / Laurel / MoE / KV-sharing. 60 layers, |
| hidden 5376, 32 heads, vocab 262144, softcap 30, tied embeddings. 5:1 sliding:full; dual head_dim |
| (sliding 256 / full `global_head_dim` 512); full layers use `num_global_key_value_heads` 4 with |
| `attention_k_eq_v` (value = raw k_proj). Both attention shapes ride one growing KV pair, so the |
| bundle loads on the stock `CoreAIPipelinedEngine` (2 states, no engine patch); the full layers' SDPA |
| runs as a custom Metal flash-decode kernel. |
| |
| ## Usage |
| |
| ```bash |
| huggingface-cli download mlboydaisuke/Gemma-4-31B-CoreAI \ |
| --include "gpu-pipelined/gemma4_31b_qat_decode_int4linsym_msdpa_g8/*" \ |
| --local-dir ./gemma4-31b-coreai |
| |
| COREAI_CHUNK_THRESHOLD=1 llm-runner \ |
| --model ./gemma4-31b-coreai/gpu-pipelined/gemma4_31b_qat_decode_int4linsym_msdpa_g8 \ |
| --prompt "What is the capital of France?" --max-tokens 64 --chunk-size 1 |
| ``` |
| |
| ## Conversion |
| |
| Community zoo: |
| [github.com/john-rocky/coreai-model-zoo β `zoo/gemma4-31b.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/gemma4-31b.md). |
| |
| ## License |
| |
| Gemma β governed by the [Gemma Terms of Use](https://ai.google.dev/gemma/terms). |
| |