mlboydaisuke
/

granite-4.0-h-CoreAI

@@ -26,8 +26,10 @@ beyond the existing patch stack.
 | surface | bundle | prefill (S=1) | decode |
 |---|---|---:|---:|
-| **1b int8lin, M4 Max** (release `llm-benchmark`, p=128 g=256) | 1.63 GB | 136.7 | **136.5 tok/s** |
-| **1b int8lin, iPhone 17 Pro** (one-shot runner) | 1.63 GB | 30.1–32.2 | **30.2–31.3 tok/s** |
 | 350m fp16, M4 Max | 0.66 GB | 193.2 | **191.1 tok/s** |
 Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded
@@ -37,9 +39,17 @@ oracle whose top-2 margin is ≥ 0.1 at every position), and the iPhone greedy s
 ## Bundles
 - `gpu-pipelined/granite_4_0_h_1b_decode_int8lin/` — full LanguageBundle (`metadata.json` +
   `tokenizer/` + `.aimodel`), **int8 linear per-block-32** (scale-multiply dequant, no LUT),
-  fp16 embed + tied lm_head in-graph, 1.63 GB. The ship configuration for both Mac and iPhone.
 - `gpu-pipelined/granite_4_0_h_350m_decode_fp16/` — the 350m as fp16, 0.66 GB. At this size
   the model is overhead-bound, not bandwidth-bound: int8 measured *slower* than fp16 (185.8
   vs 191.1) **and** fails the oracle gate (`shared_mlp.output_linear` is block-32-sensitive),

 | surface | bundle | prefill (S=1) | decode |
 |---|---|---:|---:|
+| **1b int8hu (int8 head), iPhone 17 Pro** (one-shot runner) — device ship | 1.79 GB | 35.1–37.0 | **35.4–37.1 tok/s** |
+| 1b int8hu (int8 head), M4 Max | 1.79 GB | 134.9 | 134.2 tok/s |
+| **1b int8lin, M4 Max** (release `llm-benchmark`, p=128 g=256) — Mac ship | 1.63 GB | 136.7 | **136.5 tok/s** |
+| 1b int8lin, iPhone 17 Pro (one-shot runner) | 1.63 GB | 30.1–32.2 | **30.2–31.3 tok/s** |
 | 350m fp16, M4 Max | 0.66 GB | 193.2 | **191.1 tok/s** |
 Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded
 ## Bundles
+- `gpu-pipelined/granite_4_0_h_1b_decode_int8hu_block32_sym/` — int8lin + the tied lm_head
+  untied and quantized **absmax per-block-32 int8** (`symmetric`, no clipping — clipping
+  corrupts big-vocab heads), 1.79 GB. **The device ship: +17–21% decode on iPhone** (the head
+  was ~10% of the per-token read on the bandwidth-saturated surface; on the Mac it is ~flat,
+  the engine pipeline hides the head there). Oracle gate 16/16 + decode step; device numerics
+  24/24 ≡ Mac-GPU on all 3 runs. (Do not re-quantize heads per-channel: per-channel axis-0
+  int8 is broken on the current beta GPU delegate — garbage logits.)
 - `gpu-pipelined/granite_4_0_h_1b_decode_int8lin/` — full LanguageBundle (`metadata.json` +
   `tokenizer/` + `.aimodel`), **int8 linear per-block-32** (scale-multiply dequant, no LUT),
+  fp16 embed + tied lm_head in-graph, 1.63 GB. The Mac ship configuration (136.5 vs 134.2)
+  and the lighter iPhone alternative.
 - `gpu-pipelined/granite_4_0_h_350m_decode_fp16/` — the 350m as fp16, 0.66 GB. At this size
   the model is overhead-bound, not bandwidth-bound: int8 measured *slower* than fp16 (185.8
   vs 191.1) **and** fails the oracle gate (`shared_mlp.output_linear` is block-32-sensitive),