mlboydaisuke commited on
Commit
eef67e8
Β·
verified Β·
1 Parent(s): 7587e2a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +91 -0
README.md ADDED
@@ -0,0 +1,91 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - ibm-granite/granite-4.0-h-1b
5
+ - ibm-granite/granite-4.0-h-350m
6
+ tags:
7
+ - apple
8
+ - coreai
9
+ - aimodel
10
+ - on-device
11
+ - granite
12
+ - mamba2
13
+ ---
14
+
15
+ # Granite 4.0-H 1B / 350M β€” Apple Core AI (`.aimodel`)
16
+
17
+ IBM Granite 4.0-H (Mamba2 + attention hybrid; 1b: 36 Mamba2 mixers + 4 NoPE GQA attention
18
+ layers) converted to Apple **Core AI** for iOS 27 / macOS 27 (beta), riding Apple's
19
+ **`coreai-pipelined` GPU engine** via the decode-only loop-free export β€” async encode,
20
+ on-GPU argmax sampling, on-device KV growth, zero custom kernels.
21
+
22
+ **The first SSM-scan architecture on this path**: at S=1 the Mamba2 selective scan is a
23
+ single recurrence step (no `while_loop` in the graph), and the conv/SSM states ride as two
24
+ fixed-shape extra states β€” the same shape-class as Qwen3.5's GDN, so no engine changes
25
+ beyond the existing patch stack.
26
+
27
+ | surface | bundle | prefill (S=1) | decode |
28
+ |---|---|---:|---:|
29
+ | **1b int8lin, M4 Max** (release `llm-benchmark`, p=128 g=256) | 1.63 GB | 136.7 | **136.5 tok/s** |
30
+ | **1b int8lin, iPhone 17 Pro** (one-shot runner) | 1.63 GB | 30.1–32.2 | **30.2–31.3 tok/s** |
31
+ | 350m fp16, M4 Max | 0.66 GB | 193.2 | **191.1 tok/s** |
32
+
33
+ Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded
34
+ decode step** (the [zoo](https://github.com/john-rocky/coreai-model-zoo) ship gate, on an
35
+ oracle whose top-2 margin is β‰₯ 0.1 at every position), and the iPhone greedy sequences are
36
+ **24/24 token-identical to the Mac GPU** on both fixed prompts, both runs.
37
+
38
+ ## Bundles
39
+
40
+ - `gpu-pipelined/granite_4_0_h_1b_decode_int8lin/` β€” full LanguageBundle (`metadata.json` +
41
+ `tokenizer/` + `.aimodel`), **int8 linear per-block-32** (scale-multiply dequant, no LUT),
42
+ fp16 embed + tied lm_head in-graph, 1.63 GB. The ship configuration for both Mac and iPhone.
43
+ - `gpu-pipelined/granite_4_0_h_350m_decode_fp16/` β€” the 350m as fp16, 0.66 GB. At this size
44
+ the model is overhead-bound, not bandwidth-bound: int8 measured *slower* than fp16 (185.8
45
+ vs 191.1) **and** fails the oracle gate (`shared_mlp.output_linear` is block-32-sensitive),
46
+ so the 350m ships fp16.
47
+
48
+ `input_ids` is STATIC `[1,1]` (the selective scan at S=1 ≑ one loop-free recurrence step);
49
+ position_ids + KV seq stay dynamic, so `EngineFactory` classifies the bundle dynamic β†’
50
+ pipelined engine. States: growing KV (4 attention layers) + conv columns `[36,1,conv_dim,3]`
51
+ + SSM state `[36,1,48,64,128]` (fixed shape, carried by the extra-states patch).
52
+
53
+ ## Run
54
+
55
+ Needs the engine patch stack from the
56
+ [zoo](https://github.com/john-rocky/coreai-model-zoo) (`apps/coreai-shared-product.patch` β†’
57
+ `apps/coreai-pipelined-extra-states.patch`; Apple's repo is issues-only, so capabilities ship
58
+ as patches), then:
59
+
60
+ ```bash
61
+ COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model granite_4_0_h_1b_decode_int8lin -p 128 -g 256 -n 3
62
+ ```
63
+
64
+ - `COREAI_CHUNK_THRESHOLD=1` **before engine creation** β€” prefill runs as pipelined S=1 steps
65
+ (prompt tok/s β‰ˆ decode tok/s).
66
+ - **Never call `engine.warmup()`** β€” it warms query length 256 and the static `[1,1]` graph
67
+ rejects it. A 1-token generate after load is the warmup (`llm-runner` needs
68
+ `--warmup exact --warmup-length 1`).
69
+ - Benchmark **Release** builds only (a Debug engine measures ~3Γ— slow).
70
+
71
+ ## iPhone
72
+
73
+ The 1b int8lin runs at **~31 tok/s β‰ˆ ~84% of the naive bandwidth ceiling** (~60 GB/s Γ·
74
+ 1.6 GB/token β‰ˆ 37) on an iPhone 17 Pro β€” effectively memory-bandwidth saturated; the SSM
75
+ scan costs nothing extra at S=1. Cold GPU specialization **5.7 s**, warm loads **1.9 s**
76
+ (content-keyed cache β€” no AOT compile needed, and at 1.6 GB no increased-memory entitlement
77
+ is required, unlike 2B-class bundles).
78
+
79
+ ## Reproduce
80
+
81
+ Conversion script (self-contained) + method page in the zoo:
82
+ [`conversion/export_granite4h_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_granite4h_decode_pipelined.py)
83
+ (`int8lin`, or `fp16 --hf-id ibm-granite/granite-4.0-h-350m`) Β·
84
+ [`zoo/granite-4.0-h.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/granite-4.0-h.md)
85
+ (includes the 350m int8 post-mortem and the oracle-margin rule) Β·
86
+ [`knowledge/pipelined-engine.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md)
87
+
88
+ ## License
89
+
90
+ Model weights: **Apache-2.0** (IBM Granite; `LICENSE` included). Conversion code: BSD-3-Clause
91
+ (see the zoo).