head-quant docs: per-block-32 absmax ship shape (per-channel = beta delegate bug; naming note)

4447ec6 verified 14 days ago

8.81 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3.5-0.8B
	tags:
	- coreai
	- aimodel
	- apple-silicon
	- ane
	- on-device
	- qwen3.5
	- hybrid-ssm
	- gated-deltanet
	pipeline_tag: text-generation
	---

	# Qwen3.5-0.8B — Apple Core AI (`.aimodel`)

	Qwen3.5-0.8B converted to Apple's Core AI (the Core ML successor announced at WWDC26),
	ready to run on iOS 27 / macOS 27. A hybrid linear-attention model — 3 gated-delta (Mamba-style)
	layers per full-attention layer — running through Core AI's runtime, **greedy top-1 exact vs the
	Hugging Face reference**.

	This repo publishes one bundle per platform × compute-unit: the best verified configuration
	(plus the cross-platform `gpu-pipelined/` bundle) — each file is the exact artifact behind the
	published numbers, nothing experimental.

	> Requires the iOS 27 / macOS 27 beta (Core AI ships with the OS). Conversion code, knowledge
	> base, and the Swift runner: [coreai-model-zoo](https://github.com/john-rocky/coreai-model-zoo).

	## Pick your platform (measured: iPhone 17 Pro / M4 Max, greedy, top-1 exact vs HF)

	\| Category \| File \| Precision \| Size \| Speed \|
	\|---\|---\|---\|---\|---\|
	\| GPU pipelined ★★★ (iOS + macOS, NEW ship) \| `gpu-pipelined/qwen3_5_0_8b_decode_int8hu_perchan_sym/` — full bundle (`.aimodel` + tokenizer + metadata) \| int8 linear per-block-32 + per-block-32 absmax int8 lm_head (untied; the dir name says `perchan` for historical reasons — see note below) \| 1.3 GB \| 69.7–74.0 tok/s iPhone 17 Pro · 210 tok/s M4 Max \|
	\| GPU pipelined ★★ (iOS + macOS) \| `gpu-pipelined/qwen3_5_0_8b_decode_int8lin/` — full bundle (`.aimodel` + tokenizer + metadata) \| int8 linear per-block-32 (no LUT), fp16 tied head, decode-only loop-free, dynamic KV \| 1.0 GB \| 50.3–51.5 tok/s iPhone 17 Pro · 204 tok/s M4 Max \|
	\| iOS GPU ★ \| `ios-gpu/qwen3_5_0_8b_ios_hc0_int8v3.aimodel` \| int8 fused Metal kernels (k-means LUT, fp32 accumulate) + GPU argmax head, static ctx-2048 \| 1.3 GB \| 42.5–45.4 tok/s decode \|
	\| iOS GPU ★ companion \| `ios-gpu/qwen3_5_0_8b_ios_hc_prefill_q16_b2048_int8.aimodel` \| chunked-prefill graph (q=16 blocks, int8 LUT) \| 1.0 GB \| 147 tok/s prefill (185-tok prompt: 4.2 s → 1.26 s) \|
	\| iOS GPU (previous) \| `ios-gpu/qwen3_5_0_8b_ios_hc0.aimodel` \| fp16, static ctx-2048 \| 1.4 GB \| 27.7 tok/s \|
	\| iOS ANE \| `ios-ane/qwen3_5_0_8b_decode_int8.aimodel` \| int8 k-means (fp16 embed), dynamic \| 969 MB \| 14.7 tok/s \|
	\| macOS GPU \| `macos/qwen3_5_0_8b_decode_int8.aimodel` \| same bundle as iOS ANE \| 969 MB \| 58.5 tok/s (release build) \|

	- The ★★★ ship bundle adds an untied lm_head quantized as per-block-32 absmax int8
	(`int8hu --head-sym`): the fp16 head was 54% of the per-token weight
	read on the bandwidth-bound phone — quantizing it is +40% on iPhone (and +3% on M4 Max).
	Quantize big-vocab heads with plain absmax `symmetric`; the default
	`symmetric_with_clipping` clips outlier head rows and corrupts top-1s. Greedy rollouts are
	token-identical to the ★★ bundle; same run contract.
	Naming note (2026-06-11): the directory is named `_perchan_sym`, but its head is
	per-block-32 — the export script of the day parsed the granularity flag without applying
	it (since fixed). The numbers above were measured on exactly these bytes and stand.
	Genuinely per-channel (axis-0) int8 weights turned out to be **broken on the current beta
	GPU delegate** (garbage logits — delegate lowering bug, minimal repro in the zoo), so
	per-block-32 + `symmetric` IS the correct ship shape, not a stand-in. The dir name is kept
	to avoid breaking download paths.
	- The ★★ pipelined bundle is the fastest decode on BOTH platforms, with zero custom
	kernels: a decode-only loop-free graph (static `[1,1]` query, dynamic KV) that rides Apple's
	`coreai-pipelined` engine (`CoreAILanguageModels` / `EngineFactory` — async non-blocking
	encode, on-GPU argmax sampling, on-device KV growth) instead of a per-token run loop.
	Token-for-token == the fp16-GPU sequence; 16/16 single-step top-1 vs the fp32 HF oracle.
	It needs two things from the [zoo](https://github.com/john-rocky/coreai-model-zoo):
	the [engine extra-states patch](https://github.com/john-rocky/coreai-model-zoo/blob/main/apps/coreai-pipelined-extra-states.patch)
	(the stock engine carries exactly 2 states; the SSM conv/rec states ride as fixed-shape
	extras) and `COREAI_CHUNK_THRESHOLD=1` at run time (prefill = pipelined S=1 steps ≈ decode
	speed — so for LONG prompts the ★ static pair below still wins time-to-first-token).
	Export: [conversion/export_qwen3_5_decode_pipelined.py](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_qwen3_5_decode_pipelined.py).
	- The ★ int8 fused-kernel monolith is the custom-kernel static config (~3× dynamic, ~1.6×
	the fp16 static path; the current app-release config): the device GPU is
	weight-bandwidth-bound, so fused dequant-in-matvec Metal kernels (embedded in the `.aimodel`
	— 100% Core AI, WWDC26 session 325) halve the per-token weight stream; the 248320-token tied
	head runs as a fused matvec + two-level GPU argmax (greedy). Pair it with the **prefill
	companion**: the prompt is consumed 16 tokens per pass (in-graph unrolled SSM scan, fp32
	recurrence; full blocks only, remainder + generation on the decode graph). Decode output is
	byte-identical with and without it.
	- The static monoliths are GPU-only — the fp32-SSM form does not produce correct output on
	the ANE on current betas (and custom Metal kernels are GPU-only); use the ANE bundle there.
	- The dynamic int8 bundle (one graph, prefill+decode, 4 states
	`keyCache/valueCache/convState/recState`) is the proven Neural-Engine path — and the same file
	is the best macOS config (the `ios-ane/` and `macos/` files are identical content; pick by
	folder for clarity).
	- The SSM `while_loop` doesn't lower on device delegates — these bundles use the **loop-free
	single-step decode** (bit-identical at query_len=1; the prefill graph unrolls the same scan
	16× with the state held fp32). Story + gotchas:
	[knowledge base](https://github.com/john-rocky/coreai-model-zoo/tree/main/knowledge).
	- int8 in the static/dynamic bundles is k-means palettization, gated 8/8 vs the HF oracle in
	PyTorch before export; the pipelined bundle uses linear per-block int8 (scale-multiply
	dequant — 256-entry k-means LUTs are slow on the GPU delegate, 204 vs 113 tok/s on M4 Max).
	int4 does not survive on this model (head/MLP/SSM all degrade; k-means g8 with int8
	rescue layers also fails the oracle gate; unlike Gemma 4).

	## Run it (pipelined, Swift, macOS 27)

	```bash
	git clone https://github.com/apple/coreai-models && cd coreai-models
	git apply <(curl -sL https://github.com/john-rocky/coreai-model-zoo/raw/main/apps/coreai-pipelined-extra-states.patch)
	COREAI_CHUNK_THRESHOLD=1 swift run -c release llm-benchmark \
	--model <path-to>/gpu-pipelined/qwen3_5_0_8b_decode_int8lin -p 128 -g 256 -n 3
	```

	In an app, load the bundle via `LanguageBundle` + `EngineFactory.createEngine` (set
	`COREAI_CHUNK_THRESHOLD=1` before engine creation; never call `warmup()` — it warms shape 256,
	the S=1 graph rejects it; a 1-token generate is the warmup).

	## Run it (Python, macOS 27)

	```python
	import coreai.runtime as rt
	model = await rt.AIModel.load(Path("qwen3_5_0_8b_decode_int8.aimodel"),
	rt.SpecializationOptions.from_preferred_compute_unit_kind(rt.ComputeUnitKind.gpu()))
	fn = model.load_function("main")
	out = await fn({"input_ids": rt.NDArray(ids), "position_ids": rt.NDArray(pos)}, state=state)
	```

	On device, push the bundle into your app sandbox
	(`xcrun devicectl device copy to --domain-type appDataContainer`) — see the
	[Swift runtime notes](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/swift-runtime.md).
	Tokenizer: use the original [Qwen/Qwen3.5-0.8B](https://huggingface.co/Qwen/Qwen3.5-0.8B) tokenizer
	(swift-transformers loads it directly).

	## Parity

	Greedy decode matches the HF eager reference 8/8 tokens, top-1 exact (prompt-level cosine
	0.9999+), verified on macOS conversion and re-verified end-to-end on the iPhone per compute unit.
	The int8-kernel monolith additionally passes the chained Mac-GPU greedy 8/8 vs the oracle, and the
	prefill companion passes both an oracle gate and a chunked-vs-q=1 parity gate (identical tokens).
	⚠️ Known beta issue affecting all Core AI LLMs (and how these bundles dodge it):
	[the KV-write bug page](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/coreai-beta-mpsgraph-kvwrite-bug.md).

	CoreML (iOS 18+) variant of this model: [qwen3.5-0.8B-CoreML](https://huggingface.co/mlboydaisuke/qwen3.5-0.8B-CoreML).