---
license: apache-2.0
base_model: Qwen/Qwen3-VL-4B-Instruct
tags:
- coreai
- apple
- ios
- macos
- on-device
- vision-language
- vlm
- qwen3-vl
---

# Qwen3-VL 4B — Core AI (`.aimodel`)

`Qwen/Qwen3-VL-4B-Instruct` converted to Apple **Core AI** (`.aimodel`, iOS 27 /
macOS 27): image+text → text fully on the GPU via Apple's `coreai-pipelined`
engine, zero custom kernels. The 4B sibling of the
[Qwen3-VL 2B](https://huggingface.co/mlboydaisuke/Qwen3-VL-2B-CoreAI) port — it
drops onto the **same recipe with zero code changes** (the model overlay and
exporter are fully config-driven).

Part of the [CoreAI-Model-Zoo](https://github.com/john-rocky/coreai-model-zoo);
full card with the conversion design:
[zoo/qwen3-vl.md](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/qwen3-vl.md).

## Measured

| platform | prefill tok/s | decode tok/s | numerics |
|---|---:|---:|---|
| M4 Max (macOS 27 beta) | **93.3** | **92.2** | torch ladder vs fp32-HF (positions exact, vision cos 1.000, 36/36 layers cos 1.000, decode 16/16) + engine ≡ python 24/24 on the 211-tok multimodal prompt |
| iPhone 17 Pro (iOS 27 beta) | 10–15 | **14.0 cool → ~8.5 sustained** | nat 24/24 + multimodal oracle 24/24 × 3 runs, token-identical to Mac |

Decode is bandwidth-bound: the 4.7 GB int8hu decoder reads ~4.7 GB/token, so
it runs at roughly half the 2B's rate. On iPhone the read is heavy enough to
**thermally throttle** — ~14 tok/s from a cool start, settling to ~8.5 under
sustained decode. Device cold load 52.7 s (on-device GPU specialization, no
AOT), warm 8–9 s; needs the increased-memory entitlement (4.7 GB class).

## Files

| path | what | size |
|---|---|---:|
| `gpu-pipelined/qwen3_vl_4b_instruct_decode_int8hu_s1/` | text decoder LanguageBundle (SHIP: int8 per-block-32 body + untied absmax int8 head; tokenizer + metadata included) | 4.7 GB |
| `gpu-pipelined/qwen3_vl_4b_instruct_vision/` | fixed-grid vision encoder (448×448 → 196 tokens + DeepStack), fp16 | 0.79 GB |

## How it works (short version)

The text-only pipelined engine carries the VLM through an id-space trick —
no engine code changes beyond the published
[static-inputs patch](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps):

- the vision encoder runs once per image; its embeddings ride **4 static
  graph inputs** (rewritable owned `MTLBuffer`s),
- the prompt's `<|image_pad|>` ids become **extension ids `vocab + slot`**;
  the graph selects text-table vs image-embed rows per token and applies the
  three DeepStack adds the same way,
- **interleaved M-RoPE is derived in-graph from (ids, position) alone** —
  image tokens self-locate, text tokens use a host-set shift; with zero
  embeds the same bundle is a plain Qwen3 text LLM.

Numerics are gated the zoo way: fp32-HF oracle → torch ladder (position
formula exact vs `get_rope_index`, 36/36 layers) → `.aimodel` GPU → engine ≡
python 24/24 → device 24/24.

## Run it

See the zoo's `apps/CoreAIChat` (iOS) Qwen3-VL mode and the run contract
(S=1 prefill, `COREAI_CHUNK_THRESHOLD=1`, never `engine.warmup()`) in
[knowledge/pipelined-engine.md](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md).

Conversion is reproducible from the zoo:
`conversion/export_qwen3_vl_pipelined.py int8hu --hf-id Qwen/Qwen3-VL-4B-Instruct`.

## License

Apache-2.0 (inherited from Qwen3-VL-4B-Instruct). Conversion code BSD-3-Clause
(zoo repo).