Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -31,13 +31,20 @@ published numbers, nothing experimental.
|
|
| 31 |
|
| 32 |
| Category | File | Precision | Size | Speed |
|
| 33 |
|---|---|---|---|---|
|
| 34 |
-
| **GPU pipelined β
β
** (iOS + macOS) | `gpu-pipelined/
|
|
|
|
| 35 |
| **iOS GPU β
** | `ios-gpu/qwen3_5_0_8b_ios_hc0_int8v3.aimodel` | int8 fused Metal kernels (k-means LUT, fp32 accumulate) + GPU argmax head, static ctx-2048 | 1.3 GB | **42.5β45.4 tok/s** decode |
|
| 36 |
| **iOS GPU β
companion** | `ios-gpu/qwen3_5_0_8b_ios_hc_prefill_q16_b2048_int8.aimodel` | chunked-prefill graph (q=16 blocks, int8 LUT) | 1.0 GB | **147 tok/s prefill** (185-tok prompt: 4.2 s β 1.26 s) |
|
| 37 |
| iOS GPU (previous) | `ios-gpu/qwen3_5_0_8b_ios_hc0.aimodel` | fp16, static ctx-2048 | 1.4 GB | 27.7 tok/s |
|
| 38 |
| **iOS ANE** | `ios-ane/qwen3_5_0_8b_decode_int8.aimodel` | int8 k-means (fp16 embed), dynamic | 969 MB | **14.7 tok/s** |
|
| 39 |
| **macOS GPU** | `macos/qwen3_5_0_8b_decode_int8.aimodel` | same bundle as iOS ANE | 969 MB | **58.5 tok/s** (release build) |
|
| 40 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
- The **β
β
pipelined bundle** is the fastest decode on BOTH platforms, with zero custom
|
| 42 |
kernels: a decode-only loop-free graph (static `[1,1]` query, dynamic KV) that rides Apple's
|
| 43 |
`coreai-pipelined` engine (`CoreAILanguageModels` / `EngineFactory` β async non-blocking
|
|
|
|
| 31 |
|
| 32 |
| Category | File | Precision | Size | Speed |
|
| 33 |
|---|---|---|---|---|
|
| 34 |
+
| **GPU pipelined β
β
β
** (iOS + macOS, NEW ship) | `gpu-pipelined/qwen3_5_0_8b_decode_int8hu_perchan_sym/` β full bundle (`.aimodel` + tokenizer + metadata) | int8 linear per-block-32 + **per-channel absmax int8 lm_head** (untied), decode-only loop-free, dynamic KV | 1.3 GB | **69.7β74.0 tok/s** iPhone 17 Pro Β· **210 tok/s** M4 Max |
|
| 35 |
+
| **GPU pipelined β
β
** (iOS + macOS) | `gpu-pipelined/qwen3_5_0_8b_decode_int8lin/` β full bundle (`.aimodel` + tokenizer + metadata) | int8 linear per-block-32 (no LUT), fp16 tied head, decode-only loop-free, dynamic KV | 1.0 GB | **50.3β51.5 tok/s** iPhone 17 Pro Β· **204 tok/s** M4 Max |
|
| 36 |
| **iOS GPU β
** | `ios-gpu/qwen3_5_0_8b_ios_hc0_int8v3.aimodel` | int8 fused Metal kernels (k-means LUT, fp32 accumulate) + GPU argmax head, static ctx-2048 | 1.3 GB | **42.5β45.4 tok/s** decode |
|
| 37 |
| **iOS GPU β
companion** | `ios-gpu/qwen3_5_0_8b_ios_hc_prefill_q16_b2048_int8.aimodel` | chunked-prefill graph (q=16 blocks, int8 LUT) | 1.0 GB | **147 tok/s prefill** (185-tok prompt: 4.2 s β 1.26 s) |
|
| 38 |
| iOS GPU (previous) | `ios-gpu/qwen3_5_0_8b_ios_hc0.aimodel` | fp16, static ctx-2048 | 1.4 GB | 27.7 tok/s |
|
| 39 |
| **iOS ANE** | `ios-ane/qwen3_5_0_8b_decode_int8.aimodel` | int8 k-means (fp16 embed), dynamic | 969 MB | **14.7 tok/s** |
|
| 40 |
| **macOS GPU** | `macos/qwen3_5_0_8b_decode_int8.aimodel` | same bundle as iOS ANE | 969 MB | **58.5 tok/s** (release build) |
|
| 41 |
|
| 42 |
+
- The **β
β
β
ship bundle** adds an untied lm_head quantized as **per-channel absmax int8**
|
| 43 |
+
(`int8hu --head-quant perchan --head-sym`): the fp16 head was 54% of the per-token weight
|
| 44 |
+
read on the bandwidth-bound phone β quantizing it is +40% on iPhone (and +3% on M4 Max).
|
| 45 |
+
Quantize big-vocab heads with plain absmax `symmetric`; the default
|
| 46 |
+
`symmetric_with_clipping` clips outlier head rows and corrupts top-1s. Greedy rollouts are
|
| 47 |
+
token-identical to the β
β
bundle; same run contract.
|
| 48 |
- The **β
β
pipelined bundle** is the fastest decode on BOTH platforms, with zero custom
|
| 49 |
kernels: a decode-only loop-free graph (static `[1,1]` query, dynamic KV) that rides Apple's
|
| 50 |
`coreai-pipelined` engine (`CoreAILanguageModels` / `EngineFactory` β async non-blocking
|