mlboydaisuke commited on
Commit
eaa1888
Β·
verified Β·
1 Parent(s): 0fe4178

int8hu device-ship bundle row (+17-21% iPhone); per-channel delegate-bug note

Browse files
Files changed (1) hide show
  1. README.md +13 -3
README.md CHANGED
@@ -26,8 +26,10 @@ beyond the existing patch stack.
26
 
27
  | surface | bundle | prefill (S=1) | decode |
28
  |---|---|---:|---:|
29
- | **1b int8lin, M4 Max** (release `llm-benchmark`, p=128 g=256) | 1.63 GB | 136.7 | **136.5 tok/s** |
30
- | **1b int8lin, iPhone 17 Pro** (one-shot runner) | 1.63 GB | 30.1–32.2 | **30.2–31.3 tok/s** |
 
 
31
  | 350m fp16, M4 Max | 0.66 GB | 193.2 | **191.1 tok/s** |
32
 
33
  Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded
@@ -37,9 +39,17 @@ oracle whose top-2 margin is β‰₯ 0.1 at every position), and the iPhone greedy s
37
 
38
  ## Bundles
39
 
 
 
 
 
 
 
 
40
  - `gpu-pipelined/granite_4_0_h_1b_decode_int8lin/` β€” full LanguageBundle (`metadata.json` +
41
  `tokenizer/` + `.aimodel`), **int8 linear per-block-32** (scale-multiply dequant, no LUT),
42
- fp16 embed + tied lm_head in-graph, 1.63 GB. The ship configuration for both Mac and iPhone.
 
43
  - `gpu-pipelined/granite_4_0_h_350m_decode_fp16/` β€” the 350m as fp16, 0.66 GB. At this size
44
  the model is overhead-bound, not bandwidth-bound: int8 measured *slower* than fp16 (185.8
45
  vs 191.1) **and** fails the oracle gate (`shared_mlp.output_linear` is block-32-sensitive),
 
26
 
27
  | surface | bundle | prefill (S=1) | decode |
28
  |---|---|---:|---:|
29
+ | **1b int8hu (int8 head), iPhone 17 Pro** (one-shot runner) β€” device ship | 1.79 GB | 35.1–37.0 | **35.4–37.1 tok/s** |
30
+ | 1b int8hu (int8 head), M4 Max | 1.79 GB | 134.9 | 134.2 tok/s |
31
+ | **1b int8lin, M4 Max** (release `llm-benchmark`, p=128 g=256) β€” Mac ship | 1.63 GB | 136.7 | **136.5 tok/s** |
32
+ | 1b int8lin, iPhone 17 Pro (one-shot runner) | 1.63 GB | 30.1–32.2 | **30.2–31.3 tok/s** |
33
  | 350m fp16, M4 Max | 0.66 GB | 193.2 | **191.1 tok/s** |
34
 
35
  Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded
 
39
 
40
  ## Bundles
41
 
42
+ - `gpu-pipelined/granite_4_0_h_1b_decode_int8hu_block32_sym/` β€” int8lin + the tied lm_head
43
+ untied and quantized **absmax per-block-32 int8** (`symmetric`, no clipping β€” clipping
44
+ corrupts big-vocab heads), 1.79 GB. **The device ship: +17–21% decode on iPhone** (the head
45
+ was ~10% of the per-token read on the bandwidth-saturated surface; on the Mac it is ~flat,
46
+ the engine pipeline hides the head there). Oracle gate 16/16 + decode step; device numerics
47
+ 24/24 ≑ Mac-GPU on all 3 runs. (Do not re-quantize heads per-channel: per-channel axis-0
48
+ int8 is broken on the current beta GPU delegate β€” garbage logits.)
49
  - `gpu-pipelined/granite_4_0_h_1b_decode_int8lin/` β€” full LanguageBundle (`metadata.json` +
50
  `tokenizer/` + `.aimodel`), **int8 linear per-block-32** (scale-multiply dequant, no LUT),
51
+ fp16 embed + tied lm_head in-graph, 1.63 GB. The Mac ship configuration (136.5 vs 134.2)
52
+ and the lighter iPhone alternative.
53
  - `gpu-pipelined/granite_4_0_h_350m_decode_fp16/` β€” the 350m as fp16, 0.66 GB. At this size
54
  the model is overhead-bound, not bandwidth-bound: int8 measured *slower* than fp16 (185.8
55
  vs 191.1) **and** fails the oracle gate (`shared_mlp.output_linear` is block-32-sensitive),