mlboydaisuke
/

qwen3.5-0.8B-CoreAI

@@ -31,13 +31,20 @@ published numbers, nothing experimental.
 | Category | File | Precision | Size | Speed |
 |---|---|---|---|---|
-| **GPU pipelined ★★** (iOS + macOS) | `gpu-pipelined/qwen3_5_0_8b_decode_int8lin/` — full bundle (`.aimodel` + tokenizer + metadata) | int8 linear per-block-32 (no LUT), decode-only loop-free, dynamic KV | 1.0 GB | **50.3–51.5 tok/s** iPhone 17 Pro · **204 tok/s** M4 Max |
 | **iOS GPU ★** | `ios-gpu/qwen3_5_0_8b_ios_hc0_int8v3.aimodel` | int8 fused Metal kernels (k-means LUT, fp32 accumulate) + GPU argmax head, static ctx-2048 | 1.3 GB | **42.5–45.4 tok/s** decode |
 | **iOS GPU ★ companion** | `ios-gpu/qwen3_5_0_8b_ios_hc_prefill_q16_b2048_int8.aimodel` | chunked-prefill graph (q=16 blocks, int8 LUT) | 1.0 GB | **147 tok/s prefill** (185-tok prompt: 4.2 s → 1.26 s) |
 | iOS GPU (previous) | `ios-gpu/qwen3_5_0_8b_ios_hc0.aimodel` | fp16, static ctx-2048 | 1.4 GB | 27.7 tok/s |
 | **iOS ANE** | `ios-ane/qwen3_5_0_8b_decode_int8.aimodel` | int8 k-means (fp16 embed), dynamic | 969 MB | **14.7 tok/s** |
 | **macOS GPU** | `macos/qwen3_5_0_8b_decode_int8.aimodel` | same bundle as iOS ANE | 969 MB | **58.5 tok/s** (release build) |
 - The **★★ pipelined bundle** is the fastest decode on BOTH platforms, with zero custom
   kernels: a decode-only loop-free graph (static `[1,1]` query, dynamic KV) that rides Apple's
   `coreai-pipelined` engine (`CoreAILanguageModels` / `EngineFactory` — async non-blocking

 | Category | File | Precision | Size | Speed |
 |---|---|---|---|---|
+| **GPU pipelined ★★★** (iOS + macOS, NEW ship) | `gpu-pipelined/qwen3_5_0_8b_decode_int8hu_perchan_sym/` — full bundle (`.aimodel` + tokenizer + metadata) | int8 linear per-block-32 + **per-channel absmax int8 lm_head** (untied), decode-only loop-free, dynamic KV | 1.3 GB | **69.7–74.0 tok/s** iPhone 17 Pro · **210 tok/s** M4 Max |
+| **GPU pipelined ★★** (iOS + macOS) | `gpu-pipelined/qwen3_5_0_8b_decode_int8lin/` — full bundle (`.aimodel` + tokenizer + metadata) | int8 linear per-block-32 (no LUT), fp16 tied head, decode-only loop-free, dynamic KV | 1.0 GB | **50.3–51.5 tok/s** iPhone 17 Pro · **204 tok/s** M4 Max |
 | **iOS GPU ★** | `ios-gpu/qwen3_5_0_8b_ios_hc0_int8v3.aimodel` | int8 fused Metal kernels (k-means LUT, fp32 accumulate) + GPU argmax head, static ctx-2048 | 1.3 GB | **42.5–45.4 tok/s** decode |
 | **iOS GPU ★ companion** | `ios-gpu/qwen3_5_0_8b_ios_hc_prefill_q16_b2048_int8.aimodel` | chunked-prefill graph (q=16 blocks, int8 LUT) | 1.0 GB | **147 tok/s prefill** (185-tok prompt: 4.2 s → 1.26 s) |
 | iOS GPU (previous) | `ios-gpu/qwen3_5_0_8b_ios_hc0.aimodel` | fp16, static ctx-2048 | 1.4 GB | 27.7 tok/s |
 | **iOS ANE** | `ios-ane/qwen3_5_0_8b_decode_int8.aimodel` | int8 k-means (fp16 embed), dynamic | 969 MB | **14.7 tok/s** |
 | **macOS GPU** | `macos/qwen3_5_0_8b_decode_int8.aimodel` | same bundle as iOS ANE | 969 MB | **58.5 tok/s** (release build) |
+- The **★★★ ship bundle** adds an untied lm_head quantized as **per-channel absmax int8**
+  (`int8hu --head-quant perchan --head-sym`): the fp16 head was 54% of the per-token weight
+  read on the bandwidth-bound phone — quantizing it is +40% on iPhone (and +3% on M4 Max).
+  Quantize big-vocab heads with plain absmax `symmetric`; the default
+  `symmetric_with_clipping` clips outlier head rows and corrupts top-1s. Greedy rollouts are
+  token-identical to the ★★ bundle; same run contract.
 - The **★★ pipelined bundle** is the fastest decode on BOTH platforms, with zero custom
   kernels: a decode-only loop-free graph (static `[1,1]` query, dynamic KV) that rides Apple's
   `coreai-pipelined` engine (`CoreAILanguageModels` / `EngineFactory` — async non-blocking