mlboydaisuke commited on
Commit
d97bd51
Β·
verified Β·
1 Parent(s): b647bda

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +8 -1
README.md CHANGED
@@ -31,13 +31,20 @@ published numbers, nothing experimental.
31
 
32
  | Category | File | Precision | Size | Speed |
33
  |---|---|---|---|---|
34
- | **GPU pipelined β˜…β˜…** (iOS + macOS) | `gpu-pipelined/qwen3_5_0_8b_decode_int8lin/` β€” full bundle (`.aimodel` + tokenizer + metadata) | int8 linear per-block-32 (no LUT), decode-only loop-free, dynamic KV | 1.0 GB | **50.3–51.5 tok/s** iPhone 17 Pro Β· **204 tok/s** M4 Max |
 
35
  | **iOS GPU β˜…** | `ios-gpu/qwen3_5_0_8b_ios_hc0_int8v3.aimodel` | int8 fused Metal kernels (k-means LUT, fp32 accumulate) + GPU argmax head, static ctx-2048 | 1.3 GB | **42.5–45.4 tok/s** decode |
36
  | **iOS GPU β˜… companion** | `ios-gpu/qwen3_5_0_8b_ios_hc_prefill_q16_b2048_int8.aimodel` | chunked-prefill graph (q=16 blocks, int8 LUT) | 1.0 GB | **147 tok/s prefill** (185-tok prompt: 4.2 s β†’ 1.26 s) |
37
  | iOS GPU (previous) | `ios-gpu/qwen3_5_0_8b_ios_hc0.aimodel` | fp16, static ctx-2048 | 1.4 GB | 27.7 tok/s |
38
  | **iOS ANE** | `ios-ane/qwen3_5_0_8b_decode_int8.aimodel` | int8 k-means (fp16 embed), dynamic | 969 MB | **14.7 tok/s** |
39
  | **macOS GPU** | `macos/qwen3_5_0_8b_decode_int8.aimodel` | same bundle as iOS ANE | 969 MB | **58.5 tok/s** (release build) |
40
 
 
 
 
 
 
 
41
  - The **β˜…β˜… pipelined bundle** is the fastest decode on BOTH platforms, with zero custom
42
  kernels: a decode-only loop-free graph (static `[1,1]` query, dynamic KV) that rides Apple's
43
  `coreai-pipelined` engine (`CoreAILanguageModels` / `EngineFactory` β€” async non-blocking
 
31
 
32
  | Category | File | Precision | Size | Speed |
33
  |---|---|---|---|---|
34
+ | **GPU pipelined β˜…β˜…β˜…** (iOS + macOS, NEW ship) | `gpu-pipelined/qwen3_5_0_8b_decode_int8hu_perchan_sym/` β€” full bundle (`.aimodel` + tokenizer + metadata) | int8 linear per-block-32 + **per-channel absmax int8 lm_head** (untied), decode-only loop-free, dynamic KV | 1.3 GB | **69.7–74.0 tok/s** iPhone 17 Pro Β· **210 tok/s** M4 Max |
35
+ | **GPU pipelined β˜…β˜…** (iOS + macOS) | `gpu-pipelined/qwen3_5_0_8b_decode_int8lin/` β€” full bundle (`.aimodel` + tokenizer + metadata) | int8 linear per-block-32 (no LUT), fp16 tied head, decode-only loop-free, dynamic KV | 1.0 GB | **50.3–51.5 tok/s** iPhone 17 Pro Β· **204 tok/s** M4 Max |
36
  | **iOS GPU β˜…** | `ios-gpu/qwen3_5_0_8b_ios_hc0_int8v3.aimodel` | int8 fused Metal kernels (k-means LUT, fp32 accumulate) + GPU argmax head, static ctx-2048 | 1.3 GB | **42.5–45.4 tok/s** decode |
37
  | **iOS GPU β˜… companion** | `ios-gpu/qwen3_5_0_8b_ios_hc_prefill_q16_b2048_int8.aimodel` | chunked-prefill graph (q=16 blocks, int8 LUT) | 1.0 GB | **147 tok/s prefill** (185-tok prompt: 4.2 s β†’ 1.26 s) |
38
  | iOS GPU (previous) | `ios-gpu/qwen3_5_0_8b_ios_hc0.aimodel` | fp16, static ctx-2048 | 1.4 GB | 27.7 tok/s |
39
  | **iOS ANE** | `ios-ane/qwen3_5_0_8b_decode_int8.aimodel` | int8 k-means (fp16 embed), dynamic | 969 MB | **14.7 tok/s** |
40
  | **macOS GPU** | `macos/qwen3_5_0_8b_decode_int8.aimodel` | same bundle as iOS ANE | 969 MB | **58.5 tok/s** (release build) |
41
 
42
+ - The **β˜…β˜…β˜… ship bundle** adds an untied lm_head quantized as **per-channel absmax int8**
43
+ (`int8hu --head-quant perchan --head-sym`): the fp16 head was 54% of the per-token weight
44
+ read on the bandwidth-bound phone β€” quantizing it is +40% on iPhone (and +3% on M4 Max).
45
+ Quantize big-vocab heads with plain absmax `symmetric`; the default
46
+ `symmetric_with_clipping` clips outlier head rows and corrupts top-1s. Greedy rollouts are
47
+ token-identical to the β˜…β˜… bundle; same run contract.
48
  - The **β˜…β˜… pipelined bundle** is the fastest decode on BOTH platforms, with zero custom
49
  kernels: a decode-only loop-free graph (static `[1,1]` query, dynamic KV) that rides Apple's
50
  `coreai-pipelined` engine (`CoreAILanguageModels` / `EngineFactory` β€” async non-blocking