head-quant docs: per-block-32 absmax ship shape (per-channel = beta delegate bug; naming note)

cab8e6b verified 14 days ago

4.57 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3.5-2B
	tags:
	- apple
	- coreai
	- aimodel
	- on-device
	- qwen3.5
	---

	# Qwen3.5-2B — Apple Core AI (`.aimodel`)

	Qwen3.5-2B (GDN hybrid: 18 linear-attention + 6 full-attention layers) converted to
	Apple Core AI for iOS 27 / macOS 27 (beta), riding Apple's **`coreai-pipelined`
	GPU engine** via the decode-only loop-free export — async encode, on-GPU argmax
	sampling, on-device KV growth, zero custom kernels.

	\| surface (ship bundle) \| prefill (S=1) \| decode \|
	\|---\|---:\|---:\|
	\| M4 Max (release `llm-benchmark`, p=128 g=256) \| 161.2 \| 160.8 tok/s \|
	\| iPhone 17 Pro (one-shot runner, 2 runs × 2 trials) \| 29.7–30.3 \| 28–30 tok/s — ≥ the CoreML qwen3.5-2B port (~27) \|

	Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded
	decode step** (the [zoo](https://github.com/john-rocky/coreai-model-zoo) ship gate), greedy
	rollouts token-identical to the fp16-head bundle, and the iPhone sequences are **24/24
	token-identical to the Mac GPU** on both fixed prompts.

	## Bundles

	- `gpu-pipelined/qwen3_5_2b_decode_int8hu_perchan_sym/` — the ship config (2.9 GB):
	transformer int8 linear per-block-32 + untied lm_head in per-block-32 absmax int8
	(`int8hu --head-sym`). The head trick is what unlocks the speed: the
	248 K-vocab fp16 head was ~1.0 GB of the ~2.4 GB per-token read. Crucial detail: the head
	must be quantized with plain absmax `symmetric` — the default
	`symmetric_with_clipping` clips outlier head rows and flips oracle top-1s (full story in
	the zoo's [pipelined-engine notes](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md)).
	Naming note (2026-06-11): the directory says `_perchan_sym`, but its head is
	per-block-32 — the export script of the day parsed the granularity flag without applying
	it (since fixed); byte-identical bundle sizes confirmed it. All numbers were measured on
	exactly these bytes and stand. Genuinely per-channel (axis-0) int8 weights are **broken
	on the current beta GPU delegate** (garbage logits), so per-block-32 + `symmetric` IS the
	correct ship shape. The dir name is kept to avoid breaking download paths.
	- `gpu-pipelined/qwen3_5_2b_decode_int8lin/` — fp16-head variant (2.4 GB): 127 tok/s Mac /
	19–21 iPhone. Smaller; keep if you want the head at full precision.

	Both are full LanguageBundles (`metadata.json` + `tokenizer/` + `.aimodel`), `input_ids`
	STATIC `[1,1]` (loop-free single-step GDN), position_ids + KV seq dynamic → `EngineFactory`
	classifies them dynamic → pipelined engine.

	## Run (macOS)

	Needs the engine patch stack from the
	[zoo](https://github.com/john-rocky/coreai-model-zoo) (`apps/coreai-shared-product.patch` →
	`apps/coreai-pipelined-extra-states.patch`; Apple's repo is issues-only, so capabilities ship
	as patches), then:

	```bash
	COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model qwen3_5_2b_decode_int8hu_perchan_sym -p 128 -g 256 -n 3
	```

	- `COREAI_CHUNK_THRESHOLD=1` before engine creation — prefill runs as pipelined S=1 steps
	(prompt tok/s ≈ decode tok/s).
	- Never call `engine.warmup()` — it warms query length 256 and the static `[1,1]` graph
	rejects it. A 1-token generate after load is the warmup (`llm-runner` needs
	`--warmup exact --warmup-length 1`).
	- Benchmark Release builds only (a Debug engine measures ~3× slow).

	## iPhone

	The ship bundle decodes 28–30 tok/s on iPhone 17 Pro with exact numerics. Know before you ship:

	- Requires the `com.apple.developer.kernel.increased-memory-limit` entitlement — cold GPU
	specialization dies with `std::bad_alloc` at the default jetsam limit without it.
	- Cold specialization 22.3 s (then ~5.6 s warm loads, content-keyed cache). Keep **≥4 GB free
	disk**: the spec cache is ~3 GB, and a failed cold spec leaves partial caches that make
	later attempts fail with `NSPOSIXErrorDomain code=2` at engine create — uninstall the app to
	reclaim.
	- For smaller phones / tighter RAM, the
	[0.8B pipelined bundle](https://huggingface.co/mlboydaisuke/qwen3.5-0.8B-CoreAI) does 50+
	tok/s in 1 GB.

	## Reproduce

	Conversion script (self-contained) + method page in the zoo:
	[`conversion/export_qwen3_5_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_qwen3_5_decode_pipelined.py)
	(`int8hu --head-sym --hf-id Qwen/Qwen3.5-2B`) ·
	[`knowledge/pipelined-engine.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md)