mlboydaisuke commited on
Commit
6cb3f9d
Β·
verified Β·
1 Parent(s): 726837a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +30 -21
README.md CHANGED
@@ -16,24 +16,33 @@ Apple **Core AI** for iOS 27 / macOS 27 (beta), riding Apple's **`coreai-pipelin
16
  GPU engine** via the decode-only loop-free export β€” async encode, on-GPU argmax
17
  sampling, on-device KV growth, zero custom kernels.
18
 
19
- | surface | prefill (S=1) | decode |
20
  |---|---:|---:|
21
- | **M4 Max** (release `llm-benchmark`, p=128 g=256) | 127.0 | **127.2 tok/s** |
22
- | iPhone 17 Pro (one-shot runner) | 19.6–22.4 | 19.1–21.3 tok/s |
23
 
24
  Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded
25
- decode step** (the [zoo](https://github.com/john-rocky/coreai-model-zoo) ship gate), and the
26
- iPhone greedy sequences are **24/24 token-identical to the Mac GPU** on both fixed prompts.
 
27
 
28
- ## Bundle
29
 
30
- `gpu-pipelined/qwen3_5_2b_decode_int8lin/` β€” full LanguageBundle (`metadata.json` +
31
- `tokenizer/` + `.aimodel`), **int8 linear per-block-32** (no LUT β€” 256-entry LUT dequant is
32
- slow on the GPU delegate), fp16 embed + tied lm_head in-graph, 2.39 GB.
33
- `input_ids` is STATIC `[1,1]` (loop-free single-step GDN ≑ the scan at S=1); position_ids +
34
- KV seq stay dynamic, so `EngineFactory` classifies it dynamic β†’ pipelined engine.
 
 
 
 
35
 
36
- ## Run (macOS β€” recommended surface)
 
 
 
 
37
 
38
  Needs the engine patch stack from the
39
  [zoo](https://github.com/john-rocky/coreai-model-zoo) (`apps/coreai-shared-product.patch` β†’
@@ -41,7 +50,7 @@ Needs the engine patch stack from the
41
  as patches), then:
42
 
43
  ```bash
44
- COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model qwen3_5_2b_decode_int8lin -p 128 -g 256 -n 3
45
  ```
46
 
47
  - `COREAI_CHUNK_THRESHOLD=1` **before engine creation** β€” prefill runs as pipelined S=1 steps
@@ -51,23 +60,23 @@ COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model qwen3_5_2b_decode_int8lin -p 128
51
  `--warmup exact --warmup-length 1`).
52
  - Benchmark **Release** builds only (a Debug engine measures ~3Γ— slow).
53
 
54
- ## iPhone (runs, but not the fastest phone option)
55
 
56
- The same bundle runs on iPhone 17 Pro at 19–21 tok/s with exact numerics. Know before you ship:
57
 
58
  - Requires the **`com.apple.developer.kernel.increased-memory-limit`** entitlement β€” cold GPU
59
  specialization dies with `std::bad_alloc` at the default jetsam limit without it.
60
- - Cold specialization 29.1 s (then 3.0 s warm loads, content-keyed cache). Keep **β‰₯4 GB free
61
- disk**: the spec cache is ~3.5 GB, and a failed cold spec leaves partial caches that make
62
  later attempts fail with `NSPOSIXErrorDomain code=2` at engine create β€” uninstall the app to
63
  reclaim.
64
- - Decode is fully bandwidth-bound (~2.4 GB weight read/token); the CoreML Qwen3.5-2B port
65
- (~27 tok/s) is still the faster phone option, and the
66
- [0.8B pipelined bundle](https://huggingface.co/mlboydaisuke/qwen3.5-0.8B-CoreAI) does 50+.
67
 
68
  ## Reproduce
69
 
70
  Conversion script (self-contained) + method page in the zoo:
71
  [`conversion/export_qwen3_5_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_qwen3_5_decode_pipelined.py)
72
- (`int8lin --hf-id Qwen/Qwen3.5-2B`) Β·
73
  [`knowledge/pipelined-engine.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md)
 
16
  GPU engine** via the decode-only loop-free export β€” async encode, on-GPU argmax
17
  sampling, on-device KV growth, zero custom kernels.
18
 
19
+ | surface (ship bundle) | prefill (S=1) | decode |
20
  |---|---:|---:|
21
+ | **M4 Max** (release `llm-benchmark`, p=128 g=256) | 161.2 | **160.8 tok/s** |
22
+ | **iPhone 17 Pro** (one-shot runner, 2 runs Γ— 2 trials) | 29.7–30.3 | **28–30 tok/s** β€” β‰₯ the CoreML qwen3.5-2B port (~27) |
23
 
24
  Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded
25
+ decode step** (the [zoo](https://github.com/john-rocky/coreai-model-zoo) ship gate), greedy
26
+ rollouts token-identical to the fp16-head bundle, and the iPhone sequences are **24/24
27
+ token-identical to the Mac GPU** on both fixed prompts.
28
 
29
+ ## Bundles
30
 
31
+ - **`gpu-pipelined/qwen3_5_2b_decode_int8hu_perchan_sym/` β€” the ship config (2.9 GB)**:
32
+ transformer int8 linear per-block-32 + **untied lm_head in per-channel absmax int8**
33
+ (`int8hu --head-quant perchan --head-sym`). The head trick is what unlocks the speed: the
34
+ 248 K-vocab fp16 head was ~1.0 GB of the ~2.4 GB per-token read. Crucial detail: the head
35
+ must be quantized with plain **absmax `symmetric`** β€” the default
36
+ `symmetric_with_clipping` clips outlier head rows and flips oracle top-1s (full story in
37
+ the zoo's [pipelined-engine notes](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md)).
38
+ - `gpu-pipelined/qwen3_5_2b_decode_int8lin/` β€” fp16-head variant (2.4 GB): 127 tok/s Mac /
39
+ 19–21 iPhone. Smaller; keep if you want the head at full precision.
40
 
41
+ Both are full LanguageBundles (`metadata.json` + `tokenizer/` + `.aimodel`), `input_ids`
42
+ STATIC `[1,1]` (loop-free single-step GDN), position_ids + KV seq dynamic β†’ `EngineFactory`
43
+ classifies them dynamic β†’ pipelined engine.
44
+
45
+ ## Run (macOS)
46
 
47
  Needs the engine patch stack from the
48
  [zoo](https://github.com/john-rocky/coreai-model-zoo) (`apps/coreai-shared-product.patch` β†’
 
50
  as patches), then:
51
 
52
  ```bash
53
+ COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model qwen3_5_2b_decode_int8hu_perchan_sym -p 128 -g 256 -n 3
54
  ```
55
 
56
  - `COREAI_CHUNK_THRESHOLD=1` **before engine creation** β€” prefill runs as pipelined S=1 steps
 
60
  `--warmup exact --warmup-length 1`).
61
  - Benchmark **Release** builds only (a Debug engine measures ~3Γ— slow).
62
 
63
+ ## iPhone
64
 
65
+ The ship bundle decodes 28–30 tok/s on iPhone 17 Pro with exact numerics. Know before you ship:
66
 
67
  - Requires the **`com.apple.developer.kernel.increased-memory-limit`** entitlement β€” cold GPU
68
  specialization dies with `std::bad_alloc` at the default jetsam limit without it.
69
+ - Cold specialization 22.3 s (then ~5.6 s warm loads, content-keyed cache). Keep **β‰₯4 GB free
70
+ disk**: the spec cache is ~3 GB, and a failed cold spec leaves partial caches that make
71
  later attempts fail with `NSPOSIXErrorDomain code=2` at engine create β€” uninstall the app to
72
  reclaim.
73
+ - For smaller phones / tighter RAM, the
74
+ [0.8B pipelined bundle](https://huggingface.co/mlboydaisuke/qwen3.5-0.8B-CoreAI) does 50+
75
+ tok/s in 1 GB.
76
 
77
  ## Reproduce
78
 
79
  Conversion script (self-contained) + method page in the zoo:
80
  [`conversion/export_qwen3_5_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_qwen3_5_decode_pipelined.py)
81
+ (`int8hu --head-quant perchan --head-sym --hf-id Qwen/Qwen3.5-2B`) Β·
82
  [`knowledge/pipelined-engine.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md)