| --- |
| license: apache-2.0 |
| base_model: |
| - ibm-granite/granite-4.0-h-1b |
| - ibm-granite/granite-4.0-h-350m |
| tags: |
| - apple |
| - coreai |
| - aimodel |
| - on-device |
| - granite |
| - mamba2 |
| --- |
| |
| # Granite 4.0-H 1B / 350M β Apple Core AI (`.aimodel`) |
|
|
| IBM Granite 4.0-H (Mamba2 + attention hybrid; 1b: 36 Mamba2 mixers + 4 NoPE GQA attention |
| layers) converted to Apple **Core AI** for iOS 27 / macOS 27 (beta), riding Apple's |
| **`coreai-pipelined` GPU engine** via the decode-only loop-free export β async encode, |
| on-GPU argmax sampling, on-device KV growth, zero custom kernels. |
|
|
| **The first SSM-scan architecture on this path**: at S=1 the Mamba2 selective scan is a |
| single recurrence step (no `while_loop` in the graph), and the conv/SSM states ride as two |
| fixed-shape extra states β the same shape-class as Qwen3.5's GDN, so no engine changes |
| beyond the existing patch stack. |
|
|
| | surface | bundle | prefill (S=1) | decode | |
| |---|---|---:|---:| |
| | **1b int8hu (int8 head), iPhone 17 Pro** (one-shot runner) β device ship | 1.79 GB | 35.1β37.0 | **35.4β37.1 tok/s** | |
| | 1b int8hu (int8 head), M4 Max | 1.79 GB | 134.9 | 134.2 tok/s | |
| | **1b int8lin, M4 Max** (release `llm-benchmark`, p=128 g=256) β Mac ship | 1.63 GB | 136.7 | **136.5 tok/s** | |
| | 1b int8lin, iPhone 17 Pro (one-shot runner) | 1.63 GB | 30.1β32.2 | **30.2β31.3 tok/s** | |
| | 350m fp16, M4 Max | 0.66 GB | 193.2 | **191.1 tok/s** | |
|
|
| Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded |
| decode step** (the [zoo](https://github.com/john-rocky/coreai-model-zoo) ship gate, on an |
| oracle whose top-2 margin is β₯ 0.1 at every position), and the iPhone greedy sequences are |
| **24/24 token-identical to the Mac GPU** on both fixed prompts, both runs. |
|
|
| ## Bundles |
|
|
| - `gpu-pipelined/granite_4_0_h_1b_decode_int8hu_block32_sym/` β int8lin + the tied lm_head |
| untied and quantized **absmax per-block-32 int8** (`symmetric`, no clipping β clipping |
| corrupts big-vocab heads), 1.79 GB. **The device ship: +17β21% decode on iPhone** (the head |
| was ~10% of the per-token read on the bandwidth-saturated surface; on the Mac it is ~flat, |
| the engine pipeline hides the head there). Oracle gate 16/16 + decode step; device numerics |
| 24/24 β‘ Mac-GPU on all 3 runs. (Do not re-quantize heads per-channel: per-channel axis-0 |
| int8 is broken on the current beta GPU delegate β garbage logits.) |
| - `gpu-pipelined/granite_4_0_h_1b_decode_int8lin/` β full LanguageBundle (`metadata.json` + |
| `tokenizer/` + `.aimodel`), **int8 linear per-block-32** (scale-multiply dequant, no LUT), |
| fp16 embed + tied lm_head in-graph, 1.63 GB. The Mac ship configuration (136.5 vs 134.2) |
| and the lighter iPhone alternative. |
| - `gpu-pipelined/granite_4_0_h_350m_decode_fp16/` β the 350m as fp16, 0.66 GB. At this size |
| the model is overhead-bound, not bandwidth-bound: int8 measured *slower* than fp16 (185.8 |
| vs 191.1) **and** fails the oracle gate (`shared_mlp.output_linear` is block-32-sensitive), |
| so the 350m ships fp16. |
|
|
| `input_ids` is STATIC `[1,1]` (the selective scan at S=1 β‘ one loop-free recurrence step); |
| position_ids + KV seq stay dynamic, so `EngineFactory` classifies the bundle dynamic β |
| pipelined engine. States: growing KV (4 attention layers) + conv columns `[36,1,conv_dim,3]` |
| + SSM state `[36,1,48,64,128]` (fixed shape, carried by the extra-states patch). |
|
|
| ## Run |
|
|
| Needs the engine patch stack from the |
| [zoo](https://github.com/john-rocky/coreai-model-zoo) (`apps/coreai-shared-product.patch` β |
| `apps/coreai-pipelined-extra-states.patch`; Apple's repo is issues-only, so capabilities ship |
| as patches), then: |
|
|
| ```bash |
| COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model granite_4_0_h_1b_decode_int8lin -p 128 -g 256 -n 3 |
| ``` |
|
|
| - `COREAI_CHUNK_THRESHOLD=1` **before engine creation** β prefill runs as pipelined S=1 steps |
| (prompt tok/s β decode tok/s). |
| - **Never call `engine.warmup()`** β it warms query length 256 and the static `[1,1]` graph |
| rejects it. A 1-token generate after load is the warmup (`llm-runner` needs |
| `--warmup exact --warmup-length 1`). |
| - Benchmark **Release** builds only (a Debug engine measures ~3Γ slow). |
|
|
| ## iPhone |
|
|
| The 1b int8lin runs at **~31 tok/s β ~84% of the naive bandwidth ceiling** (~60 GB/s Γ· |
| 1.6 GB/token β 37) on an iPhone 17 Pro β effectively memory-bandwidth saturated; the SSM |
| scan costs nothing extra at S=1. Cold GPU specialization **5.7 s**, warm loads **1.9 s** |
| (content-keyed cache β no AOT compile needed, and at 1.6 GB no increased-memory entitlement |
| is required, unlike 2B-class bundles). |
|
|
| ## Reproduce |
|
|
| Conversion script (self-contained) + method page in the zoo: |
| [`conversion/export_granite4h_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_granite4h_decode_pipelined.py) |
| (`int8lin`, or `fp16 --hf-id ibm-granite/granite-4.0-h-350m`) Β· |
| [`zoo/granite-4.0-h.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/granite-4.0-h.md) |
| (includes the 350m int8 post-mortem and the oracle-margin rule) Β· |
| [`knowledge/pipelined-engine.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md) |
|
|
| ## License |
|
|
| Model weights: **Apache-2.0** (IBM Granite; `LICENSE` included). Conversion code: BSD-3-Clause |
| (see the zoo). |
|
|