Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,91 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model:
|
| 4 |
+
- ibm-granite/granite-4.0-h-1b
|
| 5 |
+
- ibm-granite/granite-4.0-h-350m
|
| 6 |
+
tags:
|
| 7 |
+
- apple
|
| 8 |
+
- coreai
|
| 9 |
+
- aimodel
|
| 10 |
+
- on-device
|
| 11 |
+
- granite
|
| 12 |
+
- mamba2
|
| 13 |
+
---
|
| 14 |
+
|
| 15 |
+
# Granite 4.0-H 1B / 350M β Apple Core AI (`.aimodel`)
|
| 16 |
+
|
| 17 |
+
IBM Granite 4.0-H (Mamba2 + attention hybrid; 1b: 36 Mamba2 mixers + 4 NoPE GQA attention
|
| 18 |
+
layers) converted to Apple **Core AI** for iOS 27 / macOS 27 (beta), riding Apple's
|
| 19 |
+
**`coreai-pipelined` GPU engine** via the decode-only loop-free export β async encode,
|
| 20 |
+
on-GPU argmax sampling, on-device KV growth, zero custom kernels.
|
| 21 |
+
|
| 22 |
+
**The first SSM-scan architecture on this path**: at S=1 the Mamba2 selective scan is a
|
| 23 |
+
single recurrence step (no `while_loop` in the graph), and the conv/SSM states ride as two
|
| 24 |
+
fixed-shape extra states β the same shape-class as Qwen3.5's GDN, so no engine changes
|
| 25 |
+
beyond the existing patch stack.
|
| 26 |
+
|
| 27 |
+
| surface | bundle | prefill (S=1) | decode |
|
| 28 |
+
|---|---|---:|---:|
|
| 29 |
+
| **1b int8lin, M4 Max** (release `llm-benchmark`, p=128 g=256) | 1.63 GB | 136.7 | **136.5 tok/s** |
|
| 30 |
+
| **1b int8lin, iPhone 17 Pro** (one-shot runner) | 1.63 GB | 30.1β32.2 | **30.2β31.3 tok/s** |
|
| 31 |
+
| 350m fp16, M4 Max | 0.66 GB | 193.2 | **191.1 tok/s** |
|
| 32 |
+
|
| 33 |
+
Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded
|
| 34 |
+
decode step** (the [zoo](https://github.com/john-rocky/coreai-model-zoo) ship gate, on an
|
| 35 |
+
oracle whose top-2 margin is β₯ 0.1 at every position), and the iPhone greedy sequences are
|
| 36 |
+
**24/24 token-identical to the Mac GPU** on both fixed prompts, both runs.
|
| 37 |
+
|
| 38 |
+
## Bundles
|
| 39 |
+
|
| 40 |
+
- `gpu-pipelined/granite_4_0_h_1b_decode_int8lin/` β full LanguageBundle (`metadata.json` +
|
| 41 |
+
`tokenizer/` + `.aimodel`), **int8 linear per-block-32** (scale-multiply dequant, no LUT),
|
| 42 |
+
fp16 embed + tied lm_head in-graph, 1.63 GB. The ship configuration for both Mac and iPhone.
|
| 43 |
+
- `gpu-pipelined/granite_4_0_h_350m_decode_fp16/` β the 350m as fp16, 0.66 GB. At this size
|
| 44 |
+
the model is overhead-bound, not bandwidth-bound: int8 measured *slower* than fp16 (185.8
|
| 45 |
+
vs 191.1) **and** fails the oracle gate (`shared_mlp.output_linear` is block-32-sensitive),
|
| 46 |
+
so the 350m ships fp16.
|
| 47 |
+
|
| 48 |
+
`input_ids` is STATIC `[1,1]` (the selective scan at S=1 β‘ one loop-free recurrence step);
|
| 49 |
+
position_ids + KV seq stay dynamic, so `EngineFactory` classifies the bundle dynamic β
|
| 50 |
+
pipelined engine. States: growing KV (4 attention layers) + conv columns `[36,1,conv_dim,3]`
|
| 51 |
+
+ SSM state `[36,1,48,64,128]` (fixed shape, carried by the extra-states patch).
|
| 52 |
+
|
| 53 |
+
## Run
|
| 54 |
+
|
| 55 |
+
Needs the engine patch stack from the
|
| 56 |
+
[zoo](https://github.com/john-rocky/coreai-model-zoo) (`apps/coreai-shared-product.patch` β
|
| 57 |
+
`apps/coreai-pipelined-extra-states.patch`; Apple's repo is issues-only, so capabilities ship
|
| 58 |
+
as patches), then:
|
| 59 |
+
|
| 60 |
+
```bash
|
| 61 |
+
COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model granite_4_0_h_1b_decode_int8lin -p 128 -g 256 -n 3
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
- `COREAI_CHUNK_THRESHOLD=1` **before engine creation** β prefill runs as pipelined S=1 steps
|
| 65 |
+
(prompt tok/s β decode tok/s).
|
| 66 |
+
- **Never call `engine.warmup()`** β it warms query length 256 and the static `[1,1]` graph
|
| 67 |
+
rejects it. A 1-token generate after load is the warmup (`llm-runner` needs
|
| 68 |
+
`--warmup exact --warmup-length 1`).
|
| 69 |
+
- Benchmark **Release** builds only (a Debug engine measures ~3Γ slow).
|
| 70 |
+
|
| 71 |
+
## iPhone
|
| 72 |
+
|
| 73 |
+
The 1b int8lin runs at **~31 tok/s β ~84% of the naive bandwidth ceiling** (~60 GB/s Γ·
|
| 74 |
+
1.6 GB/token β 37) on an iPhone 17 Pro β effectively memory-bandwidth saturated; the SSM
|
| 75 |
+
scan costs nothing extra at S=1. Cold GPU specialization **5.7 s**, warm loads **1.9 s**
|
| 76 |
+
(content-keyed cache β no AOT compile needed, and at 1.6 GB no increased-memory entitlement
|
| 77 |
+
is required, unlike 2B-class bundles).
|
| 78 |
+
|
| 79 |
+
## Reproduce
|
| 80 |
+
|
| 81 |
+
Conversion script (self-contained) + method page in the zoo:
|
| 82 |
+
[`conversion/export_granite4h_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_granite4h_decode_pipelined.py)
|
| 83 |
+
(`int8lin`, or `fp16 --hf-id ibm-granite/granite-4.0-h-350m`) Β·
|
| 84 |
+
[`zoo/granite-4.0-h.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/granite-4.0-h.md)
|
| 85 |
+
(includes the 350m int8 post-mortem and the oracle-margin rule) Β·
|
| 86 |
+
[`knowledge/pipelined-engine.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md)
|
| 87 |
+
|
| 88 |
+
## License
|
| 89 |
+
|
| 90 |
+
Model weights: **Apache-2.0** (IBM Granite; `LICENSE` included). Conversion code: BSD-3-Clause
|
| 91 |
+
(see the zoo).
|