--- license: apache-2.0 language: - en - zh base_model: Qwen/Qwen3.6-27B tags: - mlx - mlx-node - quantized - awq - 2-bit - qwen3.6 - hybrid-attention - gated-delta-net - apple-silicon - unsloth-dynamic library_name: mlx-node quantized_by: mlx-node pipeline_tag: text-generation model_type: qwen3_5 --- # Qwen3.6-27B — UD-Q2_K_XL (mlx-node) 2-bit base mixed-precision quantization of [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) for Apple Silicon, using the [**Unsloth Dynamic** quantization strategy](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) via [mlx-node](https://github.com/mlx-node/mlx-node). | | Original (BF16) | This Model | |---|---|---| | **Size** | ~51 GB | **15 GB** | | **Format** | SafeTensors (sharded) | SafeTensors (sharded) | | **Precision** | BF16 uniform | Mixed 2-bit + BF16 | ## All Variants | Repo | GGUF Equivalent | Size | Decode (tok/s) | |---|---|---|---| | **[Brooooooklyn/Qwen3.6-27B-UD-Q2_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q2_K_XL-mlx) (this model)** | **UD-Q2_K_XL** | **15 GB** | **20.0** | | [Brooooooklyn/Qwen3.6-27B-UD-Q3_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q3_K_XL-mlx) | UD-Q3_K_XL | 18 GB | 16.2 | | [Brooooooklyn/Qwen3.6-27B-UD-NVFP4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-NVFP4_K_XL-mlx) | — | 21 GB | 14.9 | | [Brooooooklyn/Qwen3.6-27B-UD-MXFP4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-MXFP4_K_XL-mlx) | — | 21 GB | 15.9 | | [Brooooooklyn/Qwen3.6-27B-UD-Q4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q4_K_XL-mlx) | UD-Q4_K_XL | 21 GB | 15.3 | | [Brooooooklyn/Qwen3.6-27B-UD-Q5_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q5_K_XL-mlx) | UD-Q5_K_XL | 25 GB | 13.4 | | [Brooooooklyn/Qwen3.6-27B-UD-Q6_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q6_K_XL-mlx) | UD-Q6_K_XL | 27 GB | 12.4 | | [Brooooooklyn/Qwen3.6-27B-UD-MXFP8_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-MXFP8_K_XL-mlx) | — | 29 GB | 10.5 | | [Brooooooklyn/Qwen3.6-27B-UD-Q8_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q8_K_XL-mlx) | UD-Q8_K_XL | 30 GB | 9.9 | Benchmarked on Apple M3 Max 128GB via [`examples/lm.ts`](https://github.com/mlx-node/mlx-node/blob/main/examples/lm.ts) (best decode tok/s across turns 2–4, steady-state). ## Performance Steady-state decode: **20.0 tok/s** on Apple M3 Max 128GB (best of turns 2–4, `examples/lm.ts` capitals chat with `reasoningEffort: 'low'`). Decode is memory-bandwidth bound on Apple Silicon — fewer bytes per token directly translates to higher throughput. ## Per-Tensor Bit Assignments (N=2) | Weight | Bits | Rationale | |---|---|---| | `embed_tokens` | 4-bit | KLD ~0.15 — very low sensitivity | | `lm_head` | 5-bit | KLD ~0.05 — safest tensor | | `self_attn.q/k/v_proj` | 4-bit + AWQ | KLD ~1.5–2.9, AWQ via layernorm | | `linear_attn.in_proj_qkv/z` | 4-bit + AWQ | KLD ~2.9, AWQ via layernorm | | `self_attn.o_proj` | **bf16** | NOT AWQ-correctable | | `linear_attn.out_proj` | **bf16** | KLD ~6.0 — worst tensor | | `down_proj` | 3-bit | "Slightly more sensitive" | | `gate_proj`, `up_proj` | 2-bit | base bits | | GDN params (A_log, etc) | **bf16** | State-space dynamics | ## Quantization Strategy Based on [Unsloth Dynamic 2.0](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) per-tensor KLD analysis. Sensitive layers get higher bits with AWQ correction, while the bulk of FFN weights are aggressively quantized. imatrix AWQ pre-scaling amplifies important weight channels and fuses inverse scales into preceding layer norms (zero inference overhead). **AWQ-correctable** projections (q/k/v, in_proj_qkv/z) are quantized at 4-bit via `input_layernorm`. **Non-AWQ-correctable** projections (o_proj, out_proj) are kept at bf16 — their inputs come from attention/GDN computation, not from a norm layer. ## Architecture | Parameter | Value | |---|---| | Total parameters | 27.4B (dense — all active) | | Hidden size | 5,120 | | Layers | 64 (48 linear + 16 full attention) | | Attention heads | 24 (4 KV heads, GQA 6:1) | | Head dimension | 256 | | Intermediate size | 17,408 | | Vocab size | 248,320 | | Max context | 262,144 tokens | ## Usage ```typescript import { loadSession } from '@mlx-node/lm'; const session = await loadSession('./Qwen3.6-27B-UD-Q2_K_XL-mlx'); for await (const event of session.sendStream('Explain the hybrid attention mechanism in Qwen3.6.', { config: { maxNewTokens: 2048, temperature: 0.6, reasoningEffort: 'low' }, })) { if (!event.done) process.stdout.write(event.text); } ``` ## How It Was Made ```bash mlx convert \ -i Qwen3.6-27B \ -o Qwen3.6-27B-UD-Q2_K_XL-mlx \ -q --q-bits 2 --q-recipe unsloth \ --imatrix-path imatrix_unsloth.gguf ``` ## Acknowledgments - **[Unsloth](https://unsloth.ai)** — Quantization strategy based on their [per-layer KLD benchmarks](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) and Dynamic 2.0 methodology - **[Qwen Team](https://huggingface.co/Qwen)** — For the Qwen3.6 model family - **[Apple MLX](https://github.com/ml-explore/mlx)** — For the Metal-accelerated ML framework ## License [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) (inherited from base model).