| --- |
| license: apache-2.0 |
| base_model: |
| - Qwen/Qwen3.6-27B |
| tags: |
| - core-ai |
| - coreai |
| - apple |
| - qwen3_5 |
| - dense |
| - text-generation |
| language: |
| - en |
| pipeline_tag: text-generation |
| library_name: coreai |
| --- |
| |
| # Qwen3.6-27B — Core AI (Apple) bundle |
|
|
| The **dense Mac-class companion** to [Qwen3.6-35B-A3B-CoreAI](https://huggingface.co/mlboydaisuke/Qwen3.6-35B-A3B-CoreAI), |
| converted for Apple's **Core AI** runtime (iOS/macOS 26+ successor to Core ML). |
| Source: [`Qwen/Qwen3.6-27B`](https://huggingface.co/Qwen/Qwen3.6-27B) (text decoder). |
|
|
| Where the 35B-A3B is a sparse MoE, the 27B is the **same Qwen3.5 hybrid decoder run dense** — |
| no experts, no router, just the proven token mixers at scale. 64 layers on a 3:1 interleave of |
| **GatedDeltaNet** linear-attention mixers and **gated full attention**: |
|
|
| - full attention: head_dim 256, **GQA 24 query / 4 KV heads**, partial mRoPE θ=1e7, swish output gate; |
| - linear (GatedDeltaNet): **48 value heads over 16 key heads** (GVA — each k/q head shared across |
| *three* value heads, vs the 35B's two); |
| - every FFN a dense `MLP(17408)` (no MoE); untied 248320-vocab `lm_head`. |
|
|
| **27B parameters, all dense → the entire model is read per token.** Unlike the 35B-A3B |
| (≈3B active), there is no sparsity to hide behind: this is a true 27B-class decode — the quality |
| of a large dense model at the memory-bandwidth speed that implies on a Mac. |
|
|
| ## Bundle |
|
|
| `gpu-pipelined/qwen3_6_27b_decode_int8hu_block32_sym/` — a ready-to-run **Core AI LanguageBundle** |
| (`.aimodel` + `metadata.json` + tokenizer), **28 GB**, decode-only loop-free for Apple's pipelined |
| GPU engine. int8 linear per-block-32 weights + an absmax int8 untied head (`int8hu --head-sym`). |
|
|
| ## Measured (macOS 27 beta, M4 Max 128 GB, `llm-benchmark`, `COREAI_CHUNK_THRESHOLD=1`) |
|
|
| | metric | value | |
| |---|---| |
| | decode | **15.9 tok/s** | |
| | prefill | 15.8 tok/s (pipelined S=1) | |
| | bundle | 28 GB | |
| | numerics | int8 == full precision at every confident position (teacher-forced vs bf16 HF oracle) | |
|
|
| **Numerics in full** — 27B fp32 would need ~111 GB RAM, so the oracle is the checkpoint's native |
| **bf16**; the gate is teacher-forced single-step argmax under an oracle-margin≥0.1 rule. The result |
| is cleaner than the 35B-A3B's: **int8 adds zero confident-margin flips over full precision.** Both |
| `int8hu` and an fp16 control score 15/16 vs the bf16 oracle and fail the *same* position (margin |
| 0.50), where fp16 flips **byte-identically to int8** — a bf16-oracle-resolution artifact, not an |
| int8 defect. The only int8-vs-fp16 difference anywhere is one sub-0.1-margin tie. |
|
|
| **Speed is bandwidth-bound, as a dense 27B at int8 must be:** ~28 GB/token → 15.9 tok/s is ~87 % of |
| the M4 Max memory-bandwidth ceiling. (The 35B-A3B decodes *faster* than this despite more |
| parameters, because only ~3B are active per token — that is the MoE's whole point.) |
|
|
| **int4 is a size/speed option, not the quality ship.** A linear int4 bundle (`int4lin`, ~14 GB, |
| ~2× decode) gates 15/16 too but pays a real cost: it flips a high-confidence position that fp16 and |
| int8 both get right, and its per-position cosine is systematically lower. A **mixed-precision** |
| middle ground (MLP int4 / attention·GatedDeltaNet·head int8) was tested and rejected — keeping the |
| mixers at int8 repairs int4's flip (confirming the attention/GDN path, not the FFN, is the |
| 4-bit-sensitive part), but the int4 MLP then introduces its *own* confident flip that edge-layer |
| int8 cannot fix. So there is no quality-preserving speedup between int8 (clean, 15.9 tok/s) and int4 |
| (borderline, ~30): **int8 is the quality ship**, int4 the size/speed option, nothing useful between. |
|
|
| **Mac-only:** at 28 GB this is a 64/128 GB-Mac model (far past the iPhone memory limit). |
|
|
| ## How to run |
|
|
| This is a Core AI bundle for Apple's pipelined LLM engine (`llm-benchmark` / `llm-runner` from |
| `apple/coreai-models`, plus the community pipelined extra-states patch). The conversion recipe and |
| the full write-up live in the community zoo: |
| [github.com/john-rocky/coreai-model-zoo](https://github.com/john-rocky/coreai-model-zoo) |
| (`zoo/qwen3.6-27b.md`). The decoder reuses the shared `qwen3_5.py` overlay directly — no MoE files. |
|
|
| ```bash |
| COREAI_CHUNK_THRESHOLD=1 llm-benchmark \ |
| --model gpu-pipelined/qwen3_6_27b_decode_int8hu_block32_sym -p 64 -g 128 -n 3 |
| ``` |
|
|