mlboydaisuke commited on
Commit
abc2d84
Β·
verified Β·
1 Parent(s): a8c7035

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +79 -0
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen3-VL-4B-Instruct
4
+ tags:
5
+ - coreai
6
+ - apple
7
+ - ios
8
+ - macos
9
+ - on-device
10
+ - vision-language
11
+ - vlm
12
+ - qwen3-vl
13
+ ---
14
+
15
+ # Qwen3-VL 4B β€” Core AI (`.aimodel`)
16
+
17
+ `Qwen/Qwen3-VL-4B-Instruct` converted to Apple **Core AI** (`.aimodel`, iOS 27 /
18
+ macOS 27): image+text β†’ text fully on the GPU via Apple's `coreai-pipelined`
19
+ engine, zero custom kernels. The 4B sibling of the
20
+ [Qwen3-VL 2B](https://huggingface.co/mlboydaisuke/Qwen3-VL-2B-CoreAI) port β€” it
21
+ drops onto the **same recipe with zero code changes** (the model overlay and
22
+ exporter are fully config-driven).
23
+
24
+ Part of the [CoreAI-Model-Zoo](https://github.com/john-rocky/coreai-model-zoo);
25
+ full card with the conversion design:
26
+ [zoo/qwen3-vl.md](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/qwen3-vl.md).
27
+
28
+ ## Measured
29
+
30
+ | platform | prefill tok/s | decode tok/s | numerics |
31
+ |---|---:|---:|---|
32
+ | M4 Max (macOS 27 beta) | **93.3** | **92.2** | torch ladder vs fp32-HF (positions exact, vision cos 1.000, 36/36 layers cos 1.000, decode 16/16) + engine ≑ python 24/24 on the 211-tok multimodal prompt |
33
+ | iPhone 17 Pro (iOS 27 beta) | 10–15 | **14.0 cool β†’ ~8.5 sustained** | nat 24/24 + multimodal oracle 24/24 Γ— 3 runs, token-identical to Mac |
34
+
35
+ Decode is bandwidth-bound: the 4.7 GB int8hu decoder reads ~4.7 GB/token, so
36
+ it runs at roughly half the 2B's rate. On iPhone the read is heavy enough to
37
+ **thermally throttle** β€” ~14 tok/s from a cool start, settling to ~8.5 under
38
+ sustained decode. Device cold load 52.7 s (on-device GPU specialization, no
39
+ AOT), warm 8–9 s; needs the increased-memory entitlement (4.7 GB class).
40
+
41
+ ## Files
42
+
43
+ | path | what | size |
44
+ |---|---|---:|
45
+ | `gpu-pipelined/qwen3_vl_4b_instruct_decode_int8hu_s1/` | text decoder LanguageBundle (SHIP: int8 per-block-32 body + untied absmax int8 head; tokenizer + metadata included) | 4.7 GB |
46
+ | `gpu-pipelined/qwen3_vl_4b_instruct_vision/` | fixed-grid vision encoder (448Γ—448 β†’ 196 tokens + DeepStack), fp16 | 0.79 GB |
47
+
48
+ ## How it works (short version)
49
+
50
+ The text-only pipelined engine carries the VLM through an id-space trick β€”
51
+ no engine code changes beyond the published
52
+ [static-inputs patch](https://github.com/john-rocky/coreai-model-zoo/tree/main/apps):
53
+
54
+ - the vision encoder runs once per image; its embeddings ride **4 static
55
+ graph inputs** (rewritable owned `MTLBuffer`s),
56
+ - the prompt's `<|image_pad|>` ids become **extension ids `vocab + slot`**;
57
+ the graph selects text-table vs image-embed rows per token and applies the
58
+ three DeepStack adds the same way,
59
+ - **interleaved M-RoPE is derived in-graph from (ids, position) alone** β€”
60
+ image tokens self-locate, text tokens use a host-set shift; with zero
61
+ embeds the same bundle is a plain Qwen3 text LLM.
62
+
63
+ Numerics are gated the zoo way: fp32-HF oracle β†’ torch ladder (position
64
+ formula exact vs `get_rope_index`, 36/36 layers) β†’ `.aimodel` GPU β†’ engine ≑
65
+ python 24/24 β†’ device 24/24.
66
+
67
+ ## Run it
68
+
69
+ See the zoo's `apps/CoreAIChat` (iOS) Qwen3-VL mode and the run contract
70
+ (S=1 prefill, `COREAI_CHUNK_THRESHOLD=1`, never `engine.warmup()`) in
71
+ [knowledge/pipelined-engine.md](https://github.com/john-rocky/coreai-model-zoo/blob/main/knowledge/pipelined-engine.md).
72
+
73
+ Conversion is reproducible from the zoo:
74
+ `conversion/export_qwen3_vl_pipelined.py int8hu --hf-id Qwen/Qwen3-VL-4B-Instruct`.
75
+
76
+ ## License
77
+
78
+ Apache-2.0 (inherited from Qwen3-VL-4B-Instruct). Conversion code BSD-3-Clause
79
+ (zoo repo).