int8hu device-ship bundle row (+17-21% iPhone); per-channel delegate-bug note
Browse files
README.md
CHANGED
|
@@ -26,8 +26,10 @@ beyond the existing patch stack.
|
|
| 26 |
|
| 27 |
| surface | bundle | prefill (S=1) | decode |
|
| 28 |
|---|---|---:|---:|
|
| 29 |
-
| **1b
|
| 30 |
-
|
|
|
|
|
|
|
|
| 31 |
| 350m fp16, M4 Max | 0.66 GB | 193.2 | **191.1 tok/s** |
|
| 32 |
|
| 33 |
Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded
|
|
@@ -37,9 +39,17 @@ oracle whose top-2 margin is β₯ 0.1 at every position), and the iPhone greedy s
|
|
| 37 |
|
| 38 |
## Bundles
|
| 39 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
- `gpu-pipelined/granite_4_0_h_1b_decode_int8lin/` β full LanguageBundle (`metadata.json` +
|
| 41 |
`tokenizer/` + `.aimodel`), **int8 linear per-block-32** (scale-multiply dequant, no LUT),
|
| 42 |
-
fp16 embed + tied lm_head in-graph, 1.63 GB. The ship configuration
|
|
|
|
| 43 |
- `gpu-pipelined/granite_4_0_h_350m_decode_fp16/` β the 350m as fp16, 0.66 GB. At this size
|
| 44 |
the model is overhead-bound, not bandwidth-bound: int8 measured *slower* than fp16 (185.8
|
| 45 |
vs 191.1) **and** fails the oracle gate (`shared_mlp.output_linear` is block-32-sensitive),
|
|
|
|
| 26 |
|
| 27 |
| surface | bundle | prefill (S=1) | decode |
|
| 28 |
|---|---|---:|---:|
|
| 29 |
+
| **1b int8hu (int8 head), iPhone 17 Pro** (one-shot runner) β device ship | 1.79 GB | 35.1β37.0 | **35.4β37.1 tok/s** |
|
| 30 |
+
| 1b int8hu (int8 head), M4 Max | 1.79 GB | 134.9 | 134.2 tok/s |
|
| 31 |
+
| **1b int8lin, M4 Max** (release `llm-benchmark`, p=128 g=256) β Mac ship | 1.63 GB | 136.7 | **136.5 tok/s** |
|
| 32 |
+
| 1b int8lin, iPhone 17 Pro (one-shot runner) | 1.63 GB | 30.1β32.2 | **30.2β31.3 tok/s** |
|
| 33 |
| 350m fp16, M4 Max | 0.66 GB | 193.2 | **191.1 tok/s** |
|
| 34 |
|
| 35 |
Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded
|
|
|
|
| 39 |
|
| 40 |
## Bundles
|
| 41 |
|
| 42 |
+
- `gpu-pipelined/granite_4_0_h_1b_decode_int8hu_block32_sym/` β int8lin + the tied lm_head
|
| 43 |
+
untied and quantized **absmax per-block-32 int8** (`symmetric`, no clipping β clipping
|
| 44 |
+
corrupts big-vocab heads), 1.79 GB. **The device ship: +17β21% decode on iPhone** (the head
|
| 45 |
+
was ~10% of the per-token read on the bandwidth-saturated surface; on the Mac it is ~flat,
|
| 46 |
+
the engine pipeline hides the head there). Oracle gate 16/16 + decode step; device numerics
|
| 47 |
+
24/24 β‘ Mac-GPU on all 3 runs. (Do not re-quantize heads per-channel: per-channel axis-0
|
| 48 |
+
int8 is broken on the current beta GPU delegate β garbage logits.)
|
| 49 |
- `gpu-pipelined/granite_4_0_h_1b_decode_int8lin/` β full LanguageBundle (`metadata.json` +
|
| 50 |
`tokenizer/` + `.aimodel`), **int8 linear per-block-32** (scale-multiply dequant, no LUT),
|
| 51 |
+
fp16 embed + tied lm_head in-graph, 1.63 GB. The Mac ship configuration (136.5 vs 134.2)
|
| 52 |
+
and the lighter iPhone alternative.
|
| 53 |
- `gpu-pipelined/granite_4_0_h_350m_decode_fp16/` β the 350m as fp16, 0.66 GB. At this size
|
| 54 |
the model is overhead-bound, not bandwidth-bound: int8 measured *slower* than fp16 (185.8
|
| 55 |
vs 191.1) **and** fails the oracle gate (`shared_mlp.output_linear` is block-32-sensitive),
|