Qwen3-VL-8B-CoreAI / README.md
mlboydaisuke's picture
Upload README.md with huggingface_hub
a020061 verified
|
Raw
History Blame Contribute Delete
3.42 kB
---
license: apache-2.0
base_model: Qwen/Qwen3-VL-8B-Instruct
tags:
- coreai
- apple
- macos
- on-device
- vision-language
- vlm
- qwen3-vl
---
# Qwen3-VL 8B β€” Core AI (`.aimodel`)
`Qwen/Qwen3-VL-8B-Instruct` converted to Apple **Core AI** (`.aimodel`,
macOS 27): image+text β†’ text fully on the GPU via Apple's `coreai-pipelined`
engine, zero custom kernels. The 8B sibling of the
[Qwen3-VL 2B](https://huggingface.co/mlboydaisuke/Qwen3-VL-2B-CoreAI) port β€”
**same recipe**, with one one-line loader change for its **untied** LM head.
> **Mac-only.** The 8.7 GB int8hu decoder exceeds the iPhone increased-memory
> jetsam ceiling (~6.4 GB class). For on-device iPhone use, see the
> [4B](https://huggingface.co/mlboydaisuke/Qwen3-VL-4B-CoreAI) or
> [2B](https://huggingface.co/mlboydaisuke/Qwen3-VL-2B-CoreAI) ports.
Part of the [CoreAI-Model-Zoo](https://github.com/john-rocky/coreai-model-zoo);
full card with the conversion design:
[zoo/qwen3-vl.md](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/qwen3-vl.md).
## Measured
| platform | prefill tok/s | decode tok/s | numerics |
|---|---:|---:|---|
| M4 Max (macOS 27 beta) | **54.4** | **54.3** | torch ladder vs fp32-HF incl. untied head + depth-27 ViT (vision cos 1.0001, 36/36 layers cos 1.000, decode 16/16) + engine ≑ python 24/24 on the 211-tok multimodal prompt |
Decode is bandwidth-bound: the 8.7 GB int8hu decoder reads ~8.7 GB/token.
Vision encode runs once per image. Cold GPU specialization ~16.5 s, warm load
a few seconds.
## Files
| path | what | size |
|---|---|---:|
| `gpu-pipelined/qwen3_vl_8b_instruct_decode_int8hu_s1/` | text decoder LanguageBundle (SHIP: int8 per-block-32 body + untied absmax int8 head; tokenizer + metadata included) | 8.7 GB |
| `gpu-pipelined/qwen3_vl_8b_instruct_vision/` | fixed-grid vision encoder (448Γ—448 β†’ 196 tokens + DeepStack), fp16 | 1.1 GB |
## How it works (short version)
The text-only pipelined engine carries the VLM through an id-space trick β€”
no engine code changes beyond the published
[static-inputs patch](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps):
- the vision encoder runs once per image; its embeddings ride **4 static
graph inputs** (rewritable owned `MTLBuffer`s),
- the prompt's `<|image_pad|>` ids become **extension ids `vocab + slot`**;
the graph selects text-table vs image-embed rows per token and applies the
three DeepStack adds the same way,
- **interleaved M-RoPE is derived in-graph from (ids, position) alone** β€”
image tokens self-locate, text tokens use a host-set shift; with zero
embeds the same bundle is a plain Qwen3 text LLM.
The 8B differs from 2B/4B only in configuration: its LM head is **untied**
(a separate `lm_head.weight`, quantized int8 absmax like the body) and its
ViT is larger (depth 27, hidden 1152) β€” both absorbed by the config-driven
overlay. Numerics are gated the zoo way: fp32-HF oracle β†’ torch ladder
(36/36 layers) β†’ engine ≑ python 24/24.
## Run it
Conversion is reproducible from the zoo:
`conversion/export_qwen3_vl_pipelined.py int8hu --hf-id Qwen/Qwen3-VL-8B-Instruct`.
For the run contract (S=1 prefill, `COREAI_CHUNK_THRESHOLD=1`), see
[knowledge/pipelined-engine.md](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md).
## License
Apache-2.0 (inherited from Qwen3-VL-8B-Instruct). Conversion code BSD-3-Clause
(zoo repo).