Qwen3-VL-2B-CoreAI / README.md
mlboydaisuke's picture
card + LICENSE + demo gif
83a29ad verified
|
Raw
History Blame Contribute Delete
3.21 kB
---
license: apache-2.0
base_model: Qwen/Qwen3-VL-2B-Instruct
tags:
- coreai
- apple
- ios
- macos
- on-device
- vision-language
- vlm
- qwen3-vl
---
# Qwen3-VL 2B β€” Core AI (`.aimodel`)
**The first vision-language model on Apple's Core AI framework** (iOS 27 /
macOS 27): `Qwen/Qwen3-VL-2B-Instruct` converted to `.aimodel`, running
image+text β†’ text fully on the GPU via Apple's `coreai-pipelined` engine β€”
zero custom kernels.
Part of the [CoreAI-Model-Zoo](https://github.com/john-rocky/coreai-model-zoo);
full card with the conversion design:
[zoo/qwen3-vl.md](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/qwen3-vl.md).
![CoreAIChat Qwen3-VL demo](demo.gif)
## Measured
| platform | prefill tok/s | decode tok/s | numerics |
|---|---:|---:|---|
| M4 Max (macOS 27 beta) | **191.0** | **187.6** | full multimodal oracle gates vs fp32-HF PASS |
| iPhone 17 Pro (iOS 27 beta, settled) | **33.9** | **33.3** | text + image prompts 24/24 Γ— 8 runs, token-identical to Mac (~92% of the naive BW ceiling) |
Vision encode: ~60-80 ms/image (Mac GPU). Device cold load 12.3 s
(on-device GPU specialization, no AOT), warm 0.6–5 s. The 2.3 GB decoder
wants the increased-memory entitlement on iPhone.
## Files
| path | what | size |
|---|---|---:|
| `gpu-pipelined/qwen3_vl_2b_instruct_decode_int8hu_s1/` | text decoder LanguageBundle (SHIP: int8 per-block-32 body + untied absmax int8 head; tokenizer + metadata included) | 2.3 GB |
| `gpu-pipelined/qwen3_vl_2b_instruct_vision/` | fixed-grid vision encoder (448Γ—448 β†’ 196 tokens + DeepStack), fp16 | 0.77 GB |
| `gpu-pipelined/qwen3_vl_2b_instruct_decode_int8lin_s1/` | decoder alt: tied fp16 head (slower, smaller-RAM-spike option) | 2.0 GB |
## How it works (short version)
The text-only pipelined engine carries the VLM through an id-space trick β€”
no engine code changes beyond the published
[static-inputs patch](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps):
- the vision encoder runs once per image; its embeddings ride **4 static
graph inputs** (rewritable owned `MTLBuffer`s, ~3 MB),
- the prompt's `<|image_pad|>` ids become **extension ids `vocab + slot`**;
the graph selects text-table vs image-embed rows per token and applies
the three DeepStack adds the same way,
- **interleaved M-RoPE is derived in-graph from (ids, position) alone** β€”
image tokens self-locate, text tokens use a host-set shift; with zero
embeds the same bundle is a plain Qwen3 text LLM.
Numerics are gated the zoo way: fp32-HF oracle β†’ torch ladder (position
formula exact vs `get_rope_index`, 28/28 layers) β†’ `.aimodel` GPU gates β†’
engine ≑ python 24/24 β†’ device 24/24.
## Run it
The zoo's `apps/CoreAIChat` (iOS) has a Qwen3-VL mode with a photo picker
and downloads this repo in-app. For the run contract (S=1 prefill,
`COREAI_CHUNK_THRESHOLD=1`, never `engine.warmup()`), see
[knowledge/pipelined-engine.md](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md).
Conversion is reproducible from the zoo:
`conversion/export_qwen3_vl_pipelined.py int8hu`.
## License
Apache-2.0 (inherited from Qwen3-VL-2B-Instruct). Conversion code BSD-3-Clause
(zoo repo).