File size: 4,086 Bytes
17dfb84
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6cb3f9d
17dfb84
6cb3f9d
 
17dfb84
 
6cb3f9d
 
 
17dfb84
6cb3f9d
17dfb84
6cb3f9d
 
 
 
 
 
 
 
 
17dfb84
6cb3f9d
 
 
 
 
17dfb84
 
 
 
 
 
 
6cb3f9d
17dfb84
 
 
 
 
 
 
 
 
6cb3f9d
17dfb84
6cb3f9d
17dfb84
 
 
6cb3f9d
 
17dfb84
 
6cb3f9d
 
 
17dfb84
 
 
 
 
6cb3f9d
17dfb84
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
license: apache-2.0
base_model: Qwen/Qwen3.5-2B
tags:
  - apple
  - coreai
  - aimodel
  - on-device
  - qwen3.5
---

# Qwen3.5-2B β€” Apple Core AI (`.aimodel`)

Qwen3.5-2B (GDN hybrid: 18 linear-attention + 6 full-attention layers) converted to
Apple **Core AI** for iOS 27 / macOS 27 (beta), riding Apple's **`coreai-pipelined`
GPU engine** via the decode-only loop-free export β€” async encode, on-GPU argmax
sampling, on-device KV growth, zero custom kernels.

| surface (ship bundle) | prefill (S=1) | decode |
|---|---:|---:|
| **M4 Max** (release `llm-benchmark`, p=128 g=256) | 161.2 | **160.8 tok/s** |
| **iPhone 17 Pro** (one-shot runner, 2 runs Γ— 2 trials) | 29.7–30.3 | **28–30 tok/s** β€” β‰₯ the CoreML qwen3.5-2B port (~27) |

Numerics: **16/16 teacher-forced single-step top-1 vs the fp32 HF oracle + HF-cache-seeded
decode step** (the [zoo](https://github.com/john-rocky/coreai-model-zoo) ship gate), greedy
rollouts token-identical to the fp16-head bundle, and the iPhone sequences are **24/24
token-identical to the Mac GPU** on both fixed prompts.

## Bundles

- **`gpu-pipelined/qwen3_5_2b_decode_int8hu_perchan_sym/` β€” the ship config (2.9 GB)**:
  transformer int8 linear per-block-32 + **untied lm_head in per-channel absmax int8**
  (`int8hu --head-quant perchan --head-sym`). The head trick is what unlocks the speed: the
  248 K-vocab fp16 head was ~1.0 GB of the ~2.4 GB per-token read. Crucial detail: the head
  must be quantized with plain **absmax `symmetric`** β€” the default
  `symmetric_with_clipping` clips outlier head rows and flips oracle top-1s (full story in
  the zoo's [pipelined-engine notes](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md)).
- `gpu-pipelined/qwen3_5_2b_decode_int8lin/` β€” fp16-head variant (2.4 GB): 127 tok/s Mac /
  19–21 iPhone. Smaller; keep if you want the head at full precision.

Both are full LanguageBundles (`metadata.json` + `tokenizer/` + `.aimodel`), `input_ids`
STATIC `[1,1]` (loop-free single-step GDN), position_ids + KV seq dynamic β†’ `EngineFactory`
classifies them dynamic β†’ pipelined engine.

## Run (macOS)

Needs the engine patch stack from the
[zoo](https://github.com/john-rocky/coreai-model-zoo) (`apps/coreai-shared-product.patch` β†’
`apps/coreai-pipelined-extra-states.patch`; Apple's repo is issues-only, so capabilities ship
as patches), then:

```bash
COREAI_CHUNK_THRESHOLD=1 llm-benchmark --model qwen3_5_2b_decode_int8hu_perchan_sym -p 128 -g 256 -n 3
```

- `COREAI_CHUNK_THRESHOLD=1` **before engine creation** β€” prefill runs as pipelined S=1 steps
  (prompt tok/s β‰ˆ decode tok/s).
- **Never call `engine.warmup()`** β€” it warms query length 256 and the static `[1,1]` graph
  rejects it. A 1-token generate after load is the warmup (`llm-runner` needs
  `--warmup exact --warmup-length 1`).
- Benchmark **Release** builds only (a Debug engine measures ~3Γ— slow).

## iPhone

The ship bundle decodes 28–30 tok/s on iPhone 17 Pro with exact numerics. Know before you ship:

- Requires the **`com.apple.developer.kernel.increased-memory-limit`** entitlement β€” cold GPU
  specialization dies with `std::bad_alloc` at the default jetsam limit without it.
- Cold specialization 22.3 s (then ~5.6 s warm loads, content-keyed cache). Keep **β‰₯4 GB free
  disk**: the spec cache is ~3 GB, and a failed cold spec leaves partial caches that make
  later attempts fail with `NSPOSIXErrorDomain code=2` at engine create β€” uninstall the app to
  reclaim.
- For smaller phones / tighter RAM, the
  [0.8B pipelined bundle](https://huggingface.co/mlboydaisuke/qwen3.5-0.8B-CoreAI) does 50+
  tok/s in 1 GB.

## Reproduce

Conversion script (self-contained) + method page in the zoo:
[`conversion/export_qwen3_5_decode_pipelined.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_qwen3_5_decode_pipelined.py)
(`int8hu --head-quant perchan --head-sym --hf-id Qwen/Qwen3.5-2B`) Β·
[`knowledge/pipelined-engine.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md)