Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -16,24 +16,33 @@ Apple **Core AI** for iOS 27 / macOS 27 (beta), riding Apple's **`coreai-pipelin
|
|
| 16 |
GPU engine** via the decode-only loop-free export β async encode, on-GPU argmax
|
| 17 |
sampling, on-device KV growth, zero custom kernels.
|
| 18 |
|
| 19 |
-
| surface | prefill (S=1) | decode |
|
| 20 |
|---|---:|---:|
|
| 21 |
-
| **M4 Max** (release `llm-benchmark`, p=128 g=256) |
|
| 22 |
-
| iPhone 17 Pro (one-shot runner) |
|
| 23 |
|
| 24 |
Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded
|
| 25 |
-
decode step** (the [zoo](https://github.com/john-rocky/coreai-model-zoo) ship gate),
|
| 26 |
-
|
|
|
|
| 27 |
|
| 28 |
-
##
|
| 29 |
|
| 30 |
-
`gpu-pipelined/
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
Needs the engine patch stack from the
|
| 39 |
[zoo](https://github.com/john-rocky/coreai-model-zoo) (`apps/coreai-shared-product.patch` β
|
|
@@ -41,7 +50,7 @@ Needs the engine patch stack from the
|
|
| 41 |
as patches), then:
|
| 42 |
|
| 43 |
```bash
|
| 44 |
-
COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model
|
| 45 |
```
|
| 46 |
|
| 47 |
- `COREAI_CHUNK_THRESHOLD=1` **before engine creation** β prefill runs as pipelined S=1 steps
|
|
@@ -51,23 +60,23 @@ COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model qwen3_5_2b_decode_int8lin -p 128
|
|
| 51 |
`--warmup exact --warmup-length 1`).
|
| 52 |
- Benchmark **Release** builds only (a Debug engine measures ~3Γ slow).
|
| 53 |
|
| 54 |
-
## iPhone
|
| 55 |
|
| 56 |
-
The
|
| 57 |
|
| 58 |
- Requires the **`com.apple.developer.kernel.increased-memory-limit`** entitlement β cold GPU
|
| 59 |
specialization dies with `std::bad_alloc` at the default jetsam limit without it.
|
| 60 |
-
- Cold specialization
|
| 61 |
-
disk**: the spec cache is ~3
|
| 62 |
later attempts fail with `NSPOSIXErrorDomain code=2` at engine create β uninstall the app to
|
| 63 |
reclaim.
|
| 64 |
-
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
|
| 68 |
## Reproduce
|
| 69 |
|
| 70 |
Conversion script (self-contained) + method page in the zoo:
|
| 71 |
[`conversion/export_qwen3_5_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_qwen3_5_decode_pipelined.py)
|
| 72 |
-
(`
|
| 73 |
[`knowledge/pipelined-engine.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md)
|
|
|
|
| 16 |
GPU engine** via the decode-only loop-free export β async encode, on-GPU argmax
|
| 17 |
sampling, on-device KV growth, zero custom kernels.
|
| 18 |
|
| 19 |
+
| surface (ship bundle) | prefill (S=1) | decode |
|
| 20 |
|---|---:|---:|
|
| 21 |
+
| **M4 Max** (release `llm-benchmark`, p=128 g=256) | 161.2 | **160.8 tok/s** |
|
| 22 |
+
| **iPhone 17 Pro** (one-shot runner, 2 runs Γ 2 trials) | 29.7β30.3 | **28β30 tok/s** β β₯ the CoreML qwen3.5-2B port (~27) |
|
| 23 |
|
| 24 |
Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded
|
| 25 |
+
decode step** (the [zoo](https://github.com/john-rocky/coreai-model-zoo) ship gate), greedy
|
| 26 |
+
rollouts token-identical to the fp16-head bundle, and the iPhone sequences are **24/24
|
| 27 |
+
token-identical to the Mac GPU** on both fixed prompts.
|
| 28 |
|
| 29 |
+
## Bundles
|
| 30 |
|
| 31 |
+
- **`gpu-pipelined/qwen3_5_2b_decode_int8hu_perchan_sym/` β the ship config (2.9 GB)**:
|
| 32 |
+
transformer int8 linear per-block-32 + **untied lm_head in per-channel absmax int8**
|
| 33 |
+
(`int8hu --head-quant perchan --head-sym`). The head trick is what unlocks the speed: the
|
| 34 |
+
248 K-vocab fp16 head was ~1.0 GB of the ~2.4 GB per-token read. Crucial detail: the head
|
| 35 |
+
must be quantized with plain **absmax `symmetric`** β the default
|
| 36 |
+
`symmetric_with_clipping` clips outlier head rows and flips oracle top-1s (full story in
|
| 37 |
+
the zoo's [pipelined-engine notes](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md)).
|
| 38 |
+
- `gpu-pipelined/qwen3_5_2b_decode_int8lin/` β fp16-head variant (2.4 GB): 127 tok/s Mac /
|
| 39 |
+
19β21 iPhone. Smaller; keep if you want the head at full precision.
|
| 40 |
|
| 41 |
+
Both are full LanguageBundles (`metadata.json` + `tokenizer/` + `.aimodel`), `input_ids`
|
| 42 |
+
STATIC `[1,1]` (loop-free single-step GDN), position_ids + KV seq dynamic β `EngineFactory`
|
| 43 |
+
classifies them dynamic β pipelined engine.
|
| 44 |
+
|
| 45 |
+
## Run (macOS)
|
| 46 |
|
| 47 |
Needs the engine patch stack from the
|
| 48 |
[zoo](https://github.com/john-rocky/coreai-model-zoo) (`apps/coreai-shared-product.patch` β
|
|
|
|
| 50 |
as patches), then:
|
| 51 |
|
| 52 |
```bash
|
| 53 |
+
COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model qwen3_5_2b_decode_int8hu_perchan_sym -p 128 -g 256 -n 3
|
| 54 |
```
|
| 55 |
|
| 56 |
- `COREAI_CHUNK_THRESHOLD=1` **before engine creation** β prefill runs as pipelined S=1 steps
|
|
|
|
| 60 |
`--warmup exact --warmup-length 1`).
|
| 61 |
- Benchmark **Release** builds only (a Debug engine measures ~3Γ slow).
|
| 62 |
|
| 63 |
+
## iPhone
|
| 64 |
|
| 65 |
+
The ship bundle decodes 28β30 tok/s on iPhone 17 Pro with exact numerics. Know before you ship:
|
| 66 |
|
| 67 |
- Requires the **`com.apple.developer.kernel.increased-memory-limit`** entitlement β cold GPU
|
| 68 |
specialization dies with `std::bad_alloc` at the default jetsam limit without it.
|
| 69 |
+
- Cold specialization 22.3 s (then ~5.6 s warm loads, content-keyed cache). Keep **β₯4 GB free
|
| 70 |
+
disk**: the spec cache is ~3 GB, and a failed cold spec leaves partial caches that make
|
| 71 |
later attempts fail with `NSPOSIXErrorDomain code=2` at engine create β uninstall the app to
|
| 72 |
reclaim.
|
| 73 |
+
- For smaller phones / tighter RAM, the
|
| 74 |
+
[0.8B pipelined bundle](https://huggingface.co/mlboydaisuke/qwen3.5-0.8B-CoreAI) does 50+
|
| 75 |
+
tok/s in 1 GB.
|
| 76 |
|
| 77 |
## Reproduce
|
| 78 |
|
| 79 |
Conversion script (self-contained) + method page in the zoo:
|
| 80 |
[`conversion/export_qwen3_5_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_qwen3_5_decode_pipelined.py)
|
| 81 |
+
(`int8hu --head-quant perchan --head-sym --hf-id Qwen/Qwen3.5-2B`) Β·
|
| 82 |
[`knowledge/pipelined-engine.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md)
|