| --- |
| license: apache-2.0 |
| base_model: Qwen/Qwen3-VL-4B-Instruct |
| tags: |
| - coreai |
| - apple |
| - ios |
| - macos |
| - on-device |
| - vision-language |
| - vlm |
| - qwen3-vl |
| --- |
| |
| # Qwen3-VL 4B β Core AI (`.aimodel`) |
|
|
| `Qwen/Qwen3-VL-4B-Instruct` converted to Apple **Core AI** (`.aimodel`, iOS 27 / |
| macOS 27): image+text β text fully on the GPU via Apple's `coreai-pipelined` |
| engine, zero custom kernels. The 4B sibling of the |
| [Qwen3-VL 2B](https://huggingface.co/mlboydaisuke/Qwen3-VL-2B-CoreAI) port β it |
| drops onto the **same recipe with zero code changes** (the model overlay and |
| exporter are fully config-driven). |
|
|
| Part of the [CoreAI-Model-Zoo](https://github.com/john-rocky/coreai-model-zoo); |
| full card with the conversion design: |
| [zoo/qwen3-vl.md](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/qwen3-vl.md). |
|
|
| ## Measured |
|
|
| | platform | prefill tok/s | decode tok/s | numerics | |
| |---|---:|---:|---| |
| | M4 Max (macOS 27 beta) | **93.3** | **92.2** | torch ladder vs fp32-HF (positions exact, vision cos 1.000, 36/36 layers cos 1.000, decode 16/16) + engine β‘ python 24/24 on the 211-tok multimodal prompt | |
| | iPhone 17 Pro (iOS 27 beta) | 10β15 | **14.0 cool β ~8.5 sustained** | nat 24/24 + multimodal oracle 24/24 Γ 3 runs, token-identical to Mac | |
|
|
| Decode is bandwidth-bound: the 4.7 GB int8hu decoder reads ~4.7 GB/token, so |
| it runs at roughly half the 2B's rate. On iPhone the read is heavy enough to |
| **thermally throttle** β ~14 tok/s from a cool start, settling to ~8.5 under |
| sustained decode. Device cold load 52.7 s (on-device GPU specialization, no |
| AOT), warm 8β9 s; needs the increased-memory entitlement (4.7 GB class). |
|
|
| ## Files |
|
|
| | path | what | size | |
| |---|---|---:| |
| | `gpu-pipelined/qwen3_vl_4b_instruct_decode_int8hu_s1/` | text decoder LanguageBundle (SHIP: int8 per-block-32 body + untied absmax int8 head; tokenizer + metadata included) | 4.7 GB | |
| | `gpu-pipelined/qwen3_vl_4b_instruct_vision/` | fixed-grid vision encoder (448Γ448 β 196 tokens + DeepStack), fp16 | 0.79 GB | |
|
|
| ## How it works (short version) |
|
|
| The text-only pipelined engine carries the VLM through an id-space trick β |
| no engine code changes beyond the published |
| [static-inputs patch](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps): |
|
|
| - the vision encoder runs once per image; its embeddings ride **4 static |
| graph inputs** (rewritable owned `MTLBuffer`s), |
| - the prompt's `<|image_pad|>` ids become **extension ids `vocab + slot`**; |
| the graph selects text-table vs image-embed rows per token and applies the |
| three DeepStack adds the same way, |
| - **interleaved M-RoPE is derived in-graph from (ids, position) alone** β |
| image tokens self-locate, text tokens use a host-set shift; with zero |
| embeds the same bundle is a plain Qwen3 text LLM. |
|
|
| Numerics are gated the zoo way: fp32-HF oracle β torch ladder (position |
| formula exact vs `get_rope_index`, 36/36 layers) β `.aimodel` GPU β engine β‘ |
| python 24/24 β device 24/24. |
|
|
| ## Run it |
|
|
| See the zoo's `apps/CoreAIChat` (iOS) Qwen3-VL mode and the run contract |
| (S=1 prefill, `COREAI_CHUNK_THRESHOLD=1`, never `engine.warmup()`) in |
| [knowledge/pipelined-engine.md](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md). |
|
|
| Conversion is reproducible from the zoo: |
| `conversion/export_qwen3_vl_pipelined.py int8hu --hf-id Qwen/Qwen3-VL-4B-Instruct`. |
|
|
| ## License |
|
|
| Apache-2.0 (inherited from Qwen3-VL-4B-Instruct). Conversion code BSD-3-Clause |
| (zoo repo). |
|
|