mlboydaisuke
/

Qwen3-VL-8B-CoreAI

vision-language

Model card Files Files and versions

Qwen3-VL-8B-CoreAI / README.md

mlboydaisuke's picture

Upload README.md with huggingface_hub

a020061 verified 15 days ago

|

History Blame Contribute Delete

3.42 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3-VL-8B-Instruct
	tags:
	- coreai
	- apple
	- macos
	- on-device
	- vision-language
	- vlm
	- qwen3-vl
	---

	# Qwen3-VL 8B — Core AI (`.aimodel`)

	`Qwen/Qwen3-VL-8B-Instruct` converted to Apple Core AI (`.aimodel`,
	macOS 27): image+text → text fully on the GPU via Apple's `coreai-pipelined`
	engine, zero custom kernels. The 8B sibling of the
	[Qwen3-VL 2B](https://huggingface.co/mlboydaisuke/Qwen3-VL-2B-CoreAI) port —
	same recipe, with one one-line loader change for its untied LM head.

	> Mac-only. The 8.7 GB int8hu decoder exceeds the iPhone increased-memory
	> jetsam ceiling (~6.4 GB class). For on-device iPhone use, see the
	> [4B](https://huggingface.co/mlboydaisuke/Qwen3-VL-4B-CoreAI) or
	> [2B](https://huggingface.co/mlboydaisuke/Qwen3-VL-2B-CoreAI) ports.

	Part of the [CoreAI-Model-Zoo](https://github.com/john-rocky/coreai-model-zoo);
	full card with the conversion design:
	[zoo/qwen3-vl.md](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/qwen3-vl.md).

	## Measured

	\| platform \| prefill tok/s \| decode tok/s \| numerics \|
	\|---\|---:\|---:\|---\|
	\| M4 Max (macOS 27 beta) \| 54.4 \| 54.3 \| torch ladder vs fp32-HF incl. untied head + depth-27 ViT (vision cos 1.0001, 36/36 layers cos 1.000, decode 16/16) + engine ≡ python 24/24 on the 211-tok multimodal prompt \|

	Decode is bandwidth-bound: the 8.7 GB int8hu decoder reads ~8.7 GB/token.
	Vision encode runs once per image. Cold GPU specialization ~16.5 s, warm load
	a few seconds.

	## Files

	\| path \| what \| size \|
	\|---\|---\|---:\|
	\| `gpu-pipelined/qwen3_vl_8b_instruct_decode_int8hu_s1/` \| text decoder LanguageBundle (SHIP: int8 per-block-32 body + untied absmax int8 head; tokenizer + metadata included) \| 8.7 GB \|
	\| `gpu-pipelined/qwen3_vl_8b_instruct_vision/` \| fixed-grid vision encoder (448×448 → 196 tokens + DeepStack), fp16 \| 1.1 GB \|

	## How it works (short version)

	The text-only pipelined engine carries the VLM through an id-space trick —
	no engine code changes beyond the published
	[static-inputs patch](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps):

	- the vision encoder runs once per image; its embeddings ride **4 static
	graph inputs** (rewritable owned `MTLBuffer`s),
	- the prompt's `<\|image_pad\|>` ids become extension ids `vocab + slot`;
	the graph selects text-table vs image-embed rows per token and applies the
	three DeepStack adds the same way,
	- interleaved M-RoPE is derived in-graph from (ids, position) alone —
	image tokens self-locate, text tokens use a host-set shift; with zero
	embeds the same bundle is a plain Qwen3 text LLM.

	The 8B differs from 2B/4B only in configuration: its LM head is untied
	(a separate `lm_head.weight`, quantized int8 absmax like the body) and its
	ViT is larger (depth 27, hidden 1152) — both absorbed by the config-driven
	overlay. Numerics are gated the zoo way: fp32-HF oracle → torch ladder
	(36/36 layers) → engine ≡ python 24/24.

	## Run it

	Conversion is reproducible from the zoo:
	`conversion/export_qwen3_vl_pipelined.py int8hu --hf-id Qwen/Qwen3-VL-8B-Instruct`.
	For the run contract (S=1 prefill, `COREAI_CHUNK_THRESHOLD=1`), see
	[knowledge/pipelined-engine.md](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md).

	## License

	Apache-2.0 (inherited from Qwen3-VL-8B-Instruct). Conversion code BSD-3-Clause
	(zoo repo).