head-quant docs: per-block-32 absmax ship shape (per-channel = beta delegate bug; naming note)
4447ec6 verified | license: apache-2.0 | |
| base_model: Qwen/Qwen3.5-0.8B | |
| tags: | |
| - coreai | |
| - aimodel | |
| - apple-silicon | |
| - ane | |
| - on-device | |
| - qwen3.5 | |
| - hybrid-ssm | |
| - gated-deltanet | |
| pipeline_tag: text-generation | |
| # Qwen3.5-0.8B β Apple Core AI (`.aimodel`) | |
| **Qwen3.5-0.8B converted to Apple's Core AI** (the Core ML successor announced at WWDC26), | |
| ready to run on iOS 27 / macOS 27. A hybrid linear-attention model β 3 gated-delta (Mamba-style) | |
| layers per full-attention layer β running through Core AI's runtime, **greedy top-1 exact vs the | |
| Hugging Face reference**. | |
| This repo publishes **one bundle per platform Γ compute-unit: the best verified configuration** | |
| (plus the cross-platform `gpu-pipelined/` bundle) β each file is the exact artifact behind the | |
| published numbers, nothing experimental. | |
| > Requires the iOS 27 / macOS 27 beta (Core AI ships with the OS). Conversion code, knowledge | |
| > base, and the Swift runner: **[coreai-model-zoo](https://github.com/john-rocky/coreai-model-zoo)**. | |
| ## Pick your platform (measured: iPhone 17 Pro / M4 Max, greedy, top-1 exact vs HF) | |
| | Category | File | Precision | Size | Speed | | |
| |---|---|---|---|---| | |
| | **GPU pipelined β β β ** (iOS + macOS, NEW ship) | `gpu-pipelined/qwen3_5_0_8b_decode_int8hu_perchan_sym/` β full bundle (`.aimodel` + tokenizer + metadata) | int8 linear per-block-32 + **per-block-32 absmax int8 lm_head** (untied; the dir name says `perchan` for historical reasons β see note below) | 1.3 GB | **69.7β74.0 tok/s** iPhone 17 Pro Β· **210 tok/s** M4 Max | | |
| | **GPU pipelined β β ** (iOS + macOS) | `gpu-pipelined/qwen3_5_0_8b_decode_int8lin/` β full bundle (`.aimodel` + tokenizer + metadata) | int8 linear per-block-32 (no LUT), fp16 tied head, decode-only loop-free, dynamic KV | 1.0 GB | **50.3β51.5 tok/s** iPhone 17 Pro Β· **204 tok/s** M4 Max | | |
| | **iOS GPU β ** | `ios-gpu/qwen3_5_0_8b_ios_hc0_int8v3.aimodel` | int8 fused Metal kernels (k-means LUT, fp32 accumulate) + GPU argmax head, static ctx-2048 | 1.3 GB | **42.5β45.4 tok/s** decode | | |
| | **iOS GPU β companion** | `ios-gpu/qwen3_5_0_8b_ios_hc_prefill_q16_b2048_int8.aimodel` | chunked-prefill graph (q=16 blocks, int8 LUT) | 1.0 GB | **147 tok/s prefill** (185-tok prompt: 4.2 s β 1.26 s) | | |
| | iOS GPU (previous) | `ios-gpu/qwen3_5_0_8b_ios_hc0.aimodel` | fp16, static ctx-2048 | 1.4 GB | 27.7 tok/s | | |
| | **iOS ANE** | `ios-ane/qwen3_5_0_8b_decode_int8.aimodel` | int8 k-means (fp16 embed), dynamic | 969 MB | **14.7 tok/s** | | |
| | **macOS GPU** | `macos/qwen3_5_0_8b_decode_int8.aimodel` | same bundle as iOS ANE | 969 MB | **58.5 tok/s** (release build) | | |
| - The **β β β ship bundle** adds an untied lm_head quantized as **per-block-32 absmax int8** | |
| (`int8hu --head-sym`): the fp16 head was 54% of the per-token weight | |
| read on the bandwidth-bound phone β quantizing it is +40% on iPhone (and +3% on M4 Max). | |
| Quantize big-vocab heads with plain absmax `symmetric`; the default | |
| `symmetric_with_clipping` clips outlier head rows and corrupts top-1s. Greedy rollouts are | |
| token-identical to the β β bundle; same run contract. | |
| **Naming note (2026-06-11):** the directory is named `_perchan_sym`, but its head is | |
| per-block-32 β the export script of the day parsed the granularity flag without applying | |
| it (since fixed). The numbers above were measured on exactly these bytes and stand. | |
| Genuinely per-channel (axis-0) int8 weights turned out to be **broken on the current beta | |
| GPU delegate** (garbage logits β delegate lowering bug, minimal repro in the zoo), so | |
| per-block-32 + `symmetric` IS the correct ship shape, not a stand-in. The dir name is kept | |
| to avoid breaking download paths. | |
| - The **β β pipelined bundle** is the fastest decode on BOTH platforms, with zero custom | |
| kernels: a decode-only loop-free graph (static `[1,1]` query, dynamic KV) that rides Apple's | |
| `coreai-pipelined` engine (`CoreAILanguageModels` / `EngineFactory` β async non-blocking | |
| encode, on-GPU argmax sampling, on-device KV growth) instead of a per-token run loop. | |
| Token-for-token == the fp16-GPU sequence; 16/16 single-step top-1 vs the fp32 HF oracle. | |
| It needs two things from the [zoo](https://github.com/john-rocky/coreai-model-zoo): | |
| the [engine extra-states patch](https://github.com/john-rocky/coreai-model-zoo/blob/main/apps/coreai-pipelined-extra-states.patch) | |
| (the stock engine carries exactly 2 states; the SSM conv/rec states ride as fixed-shape | |
| extras) and `COREAI_CHUNK_THRESHOLD=1` at run time (prefill = pipelined S=1 steps β decode | |
| speed β so for LONG prompts the β static pair below still wins time-to-first-token). | |
| Export: [conversion/export_qwen3_5_decode_pipelined.py](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_qwen3_5_decode_pipelined.py). | |
| - The **β int8 fused-kernel monolith** is the custom-kernel static config (~3Γ dynamic, ~1.6Γ | |
| the fp16 static path; the current app-release config): the device GPU is | |
| weight-bandwidth-bound, so fused dequant-in-matvec Metal kernels (embedded in the `.aimodel` | |
| β 100% Core AI, WWDC26 session 325) halve the per-token weight stream; the 248320-token tied | |
| head runs as a fused matvec + two-level **GPU argmax** (greedy). Pair it with the **prefill | |
| companion**: the prompt is consumed 16 tokens per pass (in-graph unrolled SSM scan, fp32 | |
| recurrence; full blocks only, remainder + generation on the decode graph). Decode output is | |
| byte-identical with and without it. | |
| - The static monoliths are **GPU-only** β the fp32-SSM form does not produce correct output on | |
| the ANE on current betas (and custom Metal kernels are GPU-only); use the ANE bundle there. | |
| - The **dynamic int8 bundle** (one graph, prefill+decode, 4 states | |
| `keyCache/valueCache/convState/recState`) is the proven Neural-Engine path β and the same file | |
| is the best macOS config (the `ios-ane/` and `macos/` files are identical content; pick by | |
| folder for clarity). | |
| - The SSM `while_loop` doesn't lower on device delegates β these bundles use the **loop-free | |
| single-step decode** (bit-identical at query_len=1; the prefill graph unrolls the same scan | |
| 16Γ with the state held fp32). Story + gotchas: | |
| [knowledge base](https://github.com/john-rocky/coreai-model-zoo/tree/main/knowledge). | |
| - int8 in the static/dynamic bundles is k-means palettization, gated 8/8 vs the HF oracle in | |
| PyTorch before export; the pipelined bundle uses **linear** per-block int8 (scale-multiply | |
| dequant β 256-entry k-means LUTs are slow on the GPU delegate, 204 vs 113 tok/s on M4 Max). | |
| **int4 does not survive on this model** (head/MLP/SSM all degrade; k-means g8 with int8 | |
| rescue layers also fails the oracle gate; unlike Gemma 4). | |
| ## Run it (pipelined, Swift, macOS 27) | |
| ```bash | |
| git clone https://github.com/apple/coreai-models && cd coreai-models | |
| git apply <(curl -sL https://github.com/john-rocky/coreai-model-zoo/raw/main/apps/coreai-pipelined-extra-states.patch) | |
| COREAI_CHUNK_THRESHOLD=1 swift run -c release llm-benchmark \ | |
| --model <path-to>/gpu-pipelined/qwen3_5_0_8b_decode_int8lin -p 128 -g 256 -n 3 | |
| ``` | |
| In an app, load the bundle via `LanguageBundle` + `EngineFactory.createEngine` (set | |
| `COREAI_CHUNK_THRESHOLD=1` before engine creation; never call `warmup()` β it warms shape 256, | |
| the S=1 graph rejects it; a 1-token generate is the warmup). | |
| ## Run it (Python, macOS 27) | |
| ```python | |
| import coreai.runtime as rt | |
| model = await rt.AIModel.load(Path("qwen3_5_0_8b_decode_int8.aimodel"), | |
| rt.SpecializationOptions.from_preferred_compute_unit_kind(rt.ComputeUnitKind.gpu())) | |
| fn = model.load_function("main") | |
| out = await fn({"input_ids": rt.NDArray(ids), "position_ids": rt.NDArray(pos)}, state=state) | |
| ``` | |
| On device, push the bundle into your app sandbox | |
| (`xcrun devicectl device copy to --domain-type appDataContainer`) β see the | |
| [Swift runtime notes](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/swift-runtime.md). | |
| Tokenizer: use the original [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) tokenizer | |
| (swift-transformers loads it directly). | |
| ## Parity | |
| Greedy decode matches the HF eager reference **8/8 tokens, top-1 exact** (prompt-level cosine | |
| 0.9999+), verified on macOS conversion and re-verified end-to-end on the iPhone per compute unit. | |
| The int8-kernel monolith additionally passes the chained Mac-GPU greedy 8/8 vs the oracle, and the | |
| prefill companion passes both an oracle gate and a chunked-vs-q=1 parity gate (identical tokens). | |
| β οΈ Known beta issue affecting all Core AI LLMs (and how these bundles dodge it): | |
| [the KV-write bug page](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/coreai-beta-mpsgraph-kvwrite-bug.md). | |
| CoreML (iOS 18+) variant of this model: [qwen3.5-0.8B-CoreML](https://huggingface.co/mlboydaisuke/qwen3.5-0.8B-CoreML). | |