head-quant docs: per-block-32 absmax ship shape (per-channel = beta delegate bug; naming note)
cab8e6b verified | license: apache-2.0 | |
| base_model: Qwen/Qwen3.5-2B | |
| tags: | |
| - apple | |
| - coreai | |
| - aimodel | |
| - on-device | |
| - qwen3.5 | |
| # Qwen3.5-2B β Apple Core AI (`.aimodel`) | |
| Qwen3.5-2B (GDN hybrid: 18 linear-attention + 6 full-attention layers) converted to | |
| Apple **Core AI** for iOS 27 / macOS 27 (beta), riding Apple's **`coreai-pipelined` | |
| GPU engine** via the decode-only loop-free export β async encode, on-GPU argmax | |
| sampling, on-device KV growth, zero custom kernels. | |
| | surface (ship bundle) | prefill (S=1) | decode | | |
| |---|---:|---:| | |
| | **M4 Max** (release `llm-benchmark`, p=128 g=256) | 161.2 | **160.8 tok/s** | | |
| | **iPhone 17 Pro** (one-shot runner, 2 runs Γ 2 trials) | 29.7β30.3 | **28β30 tok/s** β β₯ the CoreML qwen3.5-2B port (~27) | | |
| Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded | |
| decode step** (the [zoo](https://github.com/john-rocky/coreai-model-zoo) ship gate), greedy | |
| rollouts token-identical to the fp16-head bundle, and the iPhone sequences are **24/24 | |
| token-identical to the Mac GPU** on both fixed prompts. | |
| ## Bundles | |
| - **`gpu-pipelined/qwen3_5_2b_decode_int8hu_perchan_sym/` β the ship config (2.9 GB)**: | |
| transformer int8 linear per-block-32 + **untied lm_head in per-block-32 absmax int8** | |
| (`int8hu --head-sym`). The head trick is what unlocks the speed: the | |
| 248 K-vocab fp16 head was ~1.0 GB of the ~2.4 GB per-token read. Crucial detail: the head | |
| must be quantized with plain **absmax `symmetric`** β the default | |
| `symmetric_with_clipping` clips outlier head rows and flips oracle top-1s (full story in | |
| the zoo's [pipelined-engine notes](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md)). | |
| **Naming note (2026-06-11):** the directory says `_perchan_sym`, but its head is | |
| per-block-32 β the export script of the day parsed the granularity flag without applying | |
| it (since fixed); byte-identical bundle sizes confirmed it. All numbers were measured on | |
| exactly these bytes and stand. Genuinely per-channel (axis-0) int8 weights are **broken | |
| on the current beta GPU delegate** (garbage logits), so per-block-32 + `symmetric` IS the | |
| correct ship shape. The dir name is kept to avoid breaking download paths. | |
| - `gpu-pipelined/qwen3_5_2b_decode_int8lin/` β fp16-head variant (2.4 GB): 127 tok/s Mac / | |
| 19β21 iPhone. Smaller; keep if you want the head at full precision. | |
| Both are full LanguageBundles (`metadata.json` + `tokenizer/` + `.aimodel`), `input_ids` | |
| STATIC `[1,1]` (loop-free single-step GDN), position_ids + KV seq dynamic β `EngineFactory` | |
| classifies them dynamic β pipelined engine. | |
| ## Run (macOS) | |
| Needs the engine patch stack from the | |
| [zoo](https://github.com/john-rocky/coreai-model-zoo) (`apps/coreai-shared-product.patch` β | |
| `apps/coreai-pipelined-extra-states.patch`; Apple's repo is issues-only, so capabilities ship | |
| as patches), then: | |
| ```bash | |
| COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model qwen3_5_2b_decode_int8hu_perchan_sym -p 128 -g 256 -n 3 | |
| ``` | |
| - `COREAI_CHUNK_THRESHOLD=1` **before engine creation** β prefill runs as pipelined S=1 steps | |
| (prompt tok/s β decode tok/s). | |
| - **Never call `engine.warmup()`** β it warms query length 256 and the static `[1,1]` graph | |
| rejects it. A 1-token generate after load is the warmup (`llm-runner` needs | |
| `--warmup exact --warmup-length 1`). | |
| - Benchmark **Release** builds only (a Debug engine measures ~3Γ slow). | |
| ## iPhone | |
| The ship bundle decodes 28β30 tok/s on iPhone 17 Pro with exact numerics. Know before you ship: | |
| - Requires the **`com.apple.developer.kernel.increased-memory-limit`** entitlement β cold GPU | |
| specialization dies with `std::bad_alloc` at the default jetsam limit without it. | |
| - Cold specialization 22.3 s (then ~5.6 s warm loads, content-keyed cache). Keep **β₯4 GB free | |
| disk**: the spec cache is ~3 GB, and a failed cold spec leaves partial caches that make | |
| later attempts fail with `NSPOSIXErrorDomain code=2` at engine create β uninstall the app to | |
| reclaim. | |
| - For smaller phones / tighter RAM, the | |
| [0.8B pipelined bundle](https://huggingface.co/mlboydaisuke/qwen3.5-0.8B-CoreAI) does 50+ | |
| tok/s in 1 GB. | |
| ## Reproduce | |
| Conversion script (self-contained) + method page in the zoo: | |
| [`conversion/export_qwen3_5_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_qwen3_5_decode_pipelined.py) | |
| (`int8hu --head-sym --hf-id Qwen/Qwen3.5-2B`) Β· | |
| [`knowledge/pipelined-engine.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md) | |