int8hu device-ship bundle row (+17-21% iPhone); per-channel delegate-bug note

eaa1888 verified 16 days ago

5.29 kB

	---
	license: apache-2.0
	base_model:
	- ibm-granite/granite-4.0-h-1b
	- ibm-granite/granite-4.0-h-350m
	tags:
	- apple
	- coreai
	- aimodel
	- on-device
	- granite
	- mamba2
	---

	# Granite 4.0-H 1B / 350M — Apple Core AI (`.aimodel`)

	IBM Granite 4.0-H (Mamba2 + attention hybrid; 1b: 36 Mamba2 mixers + 4 NoPE GQA attention
	layers) converted to Apple Core AI for iOS 27 / macOS 27 (beta), riding Apple's
	`coreai-pipelined` GPU engine via the decode-only loop-free export — async encode,
	on-GPU argmax sampling, on-device KV growth, zero custom kernels.

	The first SSM-scan architecture on this path: at S=1 the Mamba2 selective scan is a
	single recurrence step (no `while_loop` in the graph), and the conv/SSM states ride as two
	fixed-shape extra states — the same shape-class as Qwen3.5's GDN, so no engine changes
	beyond the existing patch stack.

	\| surface \| bundle \| prefill (S=1) \| decode \|
	\|---\|---\|---:\|---:\|
	\| 1b int8hu (int8 head), iPhone 17 Pro (one-shot runner) — device ship \| 1.79 GB \| 35.1–37.0 \| 35.4–37.1 tok/s \|
	\| 1b int8hu (int8 head), M4 Max \| 1.79 GB \| 134.9 \| 134.2 tok/s \|
	\| 1b int8lin, M4 Max (release `llm-benchmark`, p=128 g=256) — Mac ship \| 1.63 GB \| 136.7 \| 136.5 tok/s \|
	\| 1b int8lin, iPhone 17 Pro (one-shot runner) \| 1.63 GB \| 30.1–32.2 \| 30.2–31.3 tok/s \|
	\| 350m fp16, M4 Max \| 0.66 GB \| 193.2 \| 191.1 tok/s \|

	Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded
	decode step** (the [zoo](https://github.com/john-rocky/coreai-model-zoo) ship gate, on an
	oracle whose top-2 margin is ≥ 0.1 at every position), and the iPhone greedy sequences are
	24/24 token-identical to the Mac GPU on both fixed prompts, both runs.

	## Bundles

	- `gpu-pipelined/granite_4_0_h_1b_decode_int8hu_block32_sym/` — int8lin + the tied lm_head
	untied and quantized absmax per-block-32 int8 (`symmetric`, no clipping — clipping
	corrupts big-vocab heads), 1.79 GB. The device ship: +17–21% decode on iPhone (the head
	was ~10% of the per-token read on the bandwidth-saturated surface; on the Mac it is ~flat,
	the engine pipeline hides the head there). Oracle gate 16/16 + decode step; device numerics
	24/24 ≡ Mac-GPU on all 3 runs. (Do not re-quantize heads per-channel: per-channel axis-0
	int8 is broken on the current beta GPU delegate — garbage logits.)
	- `gpu-pipelined/granite_4_0_h_1b_decode_int8lin/` — full LanguageBundle (`metadata.json` +
	`tokenizer/` + `.aimodel`), int8 linear per-block-32 (scale-multiply dequant, no LUT),
	fp16 embed + tied lm_head in-graph, 1.63 GB. The Mac ship configuration (136.5 vs 134.2)
	and the lighter iPhone alternative.
	- `gpu-pipelined/granite_4_0_h_350m_decode_fp16/` — the 350m as fp16, 0.66 GB. At this size
	the model is overhead-bound, not bandwidth-bound: int8 measured slower than fp16 (185.8
	vs 191.1) and fails the oracle gate (`shared_mlp.output_linear` is block-32-sensitive),
	so the 350m ships fp16.

	`input_ids` is STATIC `[1,1]` (the selective scan at S=1 ≡ one loop-free recurrence step);
	position_ids + KV seq stay dynamic, so `EngineFactory` classifies the bundle dynamic →
	pipelined engine. States: growing KV (4 attention layers) + conv columns `[36,1,conv_dim,3]`
	+ SSM state `[36,1,48,64,128]` (fixed shape, carried by the extra-states patch).

	## Run

	Needs the engine patch stack from the
	[zoo](https://github.com/john-rocky/coreai-model-zoo) (`apps/coreai-shared-product.patch` →
	`apps/coreai-pipelined-extra-states.patch`; Apple's repo is issues-only, so capabilities ship
	as patches), then:

	```bash
	COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model granite_4_0_h_1b_decode_int8lin -p 128 -g 256 -n 3
	```

	- `COREAI_CHUNK_THRESHOLD=1` before engine creation — prefill runs as pipelined S=1 steps
	(prompt tok/s ≈ decode tok/s).
	- Never call `engine.warmup()` — it warms query length 256 and the static `[1,1]` graph
	rejects it. A 1-token generate after load is the warmup (`llm-runner` needs
	`--warmup exact --warmup-length 1`).
	- Benchmark Release builds only (a Debug engine measures ~3× slow).

	## iPhone

	The 1b int8lin runs at ~31 tok/s ≈ ~84% of the naive bandwidth ceiling (~60 GB/s ÷
	1.6 GB/token ≈ 37) on an iPhone 17 Pro — effectively memory-bandwidth saturated; the SSM
	scan costs nothing extra at S=1. Cold GPU specialization 5.7 s, warm loads 1.9 s
	(content-keyed cache — no AOT compile needed, and at 1.6 GB no increased-memory entitlement
	is required, unlike 2B-class bundles).

	## Reproduce

	Conversion script (self-contained) + method page in the zoo:
	[`conversion/export_granite4h_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_granite4h_decode_pipelined.py)
	(`int8lin`, or `fp16 --hf-id ibm-granite/granite-4.0-h-350m`) ·
	[`zoo/granite-4.0-h.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/granite-4.0-h.md)
	(includes the 350m int8 post-mortem and the oracle-margin rule) ·
	[`knowledge/pipelined-engine.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md)

	## License

	Model weights: Apache-2.0 (IBM Granite; `LICENSE` included). Conversion code: BSD-3-Clause
	(see the zoo).