--- license: apache-2.0 language: - en - zh base_model: Qwen/Qwen3.6-27B tags: - mlx - mlx-node - quantized - awq - nvfp4 - micro-scaling-fp - qwen3.6 - hybrid-attention - gated-delta-net - apple-silicon - unsloth-dynamic library_name: mlx-node quantized_by: mlx-node pipeline_tag: text-generation model_type: qwen3_5 --- # Qwen3.6-27B — UD-NVFP4_K_XL (mlx-node) NVFP4 (NVIDIA Blackwell FP4) quantization of [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) for Apple Silicon, using the [**Unsloth Dynamic** quantization strategy](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) via [mlx-node](https://github.com/mlx-node/mlx-node). | | Original (BF16) | UD-Q4_K_XL (affine) | This Model | |---|---|---|---| | **Size** | ~66 GB | 21 GB | **21 GB** | | **Format** | SafeTensors | SafeTensors | SafeTensors | | **Precision** | BF16 uniform | 4-bit affine + BF16 | NVFP4 (E4M3 scales) + mixed affine + BF16 | | **FFN group size** | — | 64 | **16 (nvfp4) / 64 (affine)** | | **Biases** | — | yes | **no (FFN nvfp4); yes (affine layers)** | ## What is NVFP4? NVFP4 is NVIDIA's [Blackwell FP4 micro-scaling format](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/). Each group of **16** elements shares a single 8-bit **E4M3** scale (a full FP8 number with mantissa, not just an exponent), and elements themselves are stored as E2M1 FP4 values. Compared to MXFP4 (OCP): - **Higher fidelity scale**: E4M3 scale has 3 mantissa bits, MXFP4's E8M0 only encodes a power-of-two - **Smaller group**: 16 vs. 32 — fewer outliers per scale, better dynamic range tracking - **Higher scale density**: 2× the scales per weight (16 elements vs 32 per scale), so per-byte overhead is roughly the same as MXFP4 despite the richer scale type - **No biases**: zero-point implicit (FP4 covers ±range) For dense LLM weights, NVFP4 typically beats MXFP4 on perplexity at the same nominal bit budget, with the trade-off of slightly more metadata per weight. ## All Variants | Repo | GGUF Equivalent | Size | Decode (tok/s) | |---|---|---|---| | [Brooooooklyn/Qwen3.6-27B-UD-Q2_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q2_K_XL-mlx) | UD-Q2_K_XL | 15 GB | 20.0 | | [Brooooooklyn/Qwen3.6-27B-UD-Q3_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q3_K_XL-mlx) | UD-Q3_K_XL | 18 GB | 16.2 | | [Brooooooklyn/Qwen3.6-27B-UD-MXFP4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-MXFP4_K_XL-mlx) | — | 21 GB | 15.9 | | **[Brooooooklyn/Qwen3.6-27B-UD-NVFP4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-NVFP4_K_XL-mlx) (this model)** | **—** | **21 GB** | **14.9** | | [Brooooooklyn/Qwen3.6-27B-UD-Q4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q4_K_XL-mlx) | UD-Q4_K_XL | 21 GB | 15.3 | | [Brooooooklyn/Qwen3.6-27B-UD-Q5_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q5_K_XL-mlx) | UD-Q5_K_XL | 25 GB | 13.4 | | [Brooooooklyn/Qwen3.6-27B-UD-Q6_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q6_K_XL-mlx) | UD-Q6_K_XL | 27 GB | 12.4 | | [Brooooooklyn/Qwen3.6-27B-UD-MXFP8_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-MXFP8_K_XL-mlx) | — | 29 GB | 10.5 | | [Brooooooklyn/Qwen3.6-27B-UD-Q8_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q8_K_XL-mlx) | UD-Q8_K_XL | 30 GB | 9.9 | Benchmarked on Apple M3 Max 128GB via [`examples/lm.ts`](https://github.com/mlx-node/mlx-node/blob/main/examples/lm.ts) (best decode tok/s across turns 2–4, steady-state). ## Performance Steady-state decode: **14.9 tok/s** on Apple M3 Max 128GB (best of turns 2–4, `examples/lm.ts` capitals chat with `reasoningEffort: 'low'`). Decode is memory-bandwidth bound on Apple Silicon — fewer bytes per token directly translates to higher throughput. This NVFP4 build preserves bf16 escape hatches for the most sensitive tensors (`linear_attn.out_proj`, `self_attn.o_proj`), so it trades a few tok/s for measurably better quality vs. a uniform-FP4 layout. ## Per-Tensor Bit Assignments (N=4) | Weight | Mode | Bits | Group | Rationale | |---|---|---|---|---| | `embed_tokens` | 6-bit affine | 6 | 64 | Loader is affine-only; nvfp upgrade skipped (unsloth `base+2`) | | `lm_head` | 8-bit affine | 8 | 64 | Loader is affine-only; nvfp upgrade skipped (unsloth `base+3` snapped 7→8) | | `self_attn.q/k/v_proj` | 6-bit affine + AWQ | 6 | 64 | AWQ via input_layernorm; preserved at higher bits (unsloth `base+2`) | | `linear_attn.in_proj_qkv/z` | 6-bit affine + AWQ | 6 | 64 | AWQ via input_layernorm; preserved at higher bits (unsloth `base+2`) | | `self_attn.o_proj` | **bf16** | — | — | NOT AWQ-correctable | | `linear_attn.out_proj` | **bf16** | — | — | KLD ~6.0 — worst tensor, kept full-precision | | `mlp.down_proj` | 5-bit affine | 5 | 64 | "Slightly more sensitive" (unsloth `base+1`) | | `mlp.gate_proj`, `mlp.up_proj` | **nvfp4** | 4 | 16 | Unsloth UD-Q4 base — promoted to NVFP4 | | GDN params (A_log, etc) | **bf16** | — | — | State-space dynamics | ## Quantization Strategy Built on Unsloth Dynamic 2.0 per-tensor KLD analysis. At `--q-bits 4` the unsloth recipe's per-layer bit offsets become 4-bit FFN gate/up (promoted to NVFP4), 5-bit `down_proj`, 6-bit attn/SSM projections with AWQ pre-scaling, 6-bit `embed_tokens`, and 8-bit `lm_head`. Then `--q-mode nvfp4` orthogonally promotes only the 4-bit affine decisions to NVFP4 (`mode="nvfp4", bits=4, group_size=16`) — non-4-bit decisions stay affine at their original bit width, and AWQ-uncorrectable projections (`o_proj`, `out_proj`) stay bf16. imatrix AWQ pre-scaling amplifies important weight channels and fuses inverse scales into preceding layer norms (zero inference overhead). **AWQ-correctable** projections (q/k/v, in_proj_qkv/z) get the AWQ pass; **non-AWQ-correctable** projections (o_proj, out_proj) stay bf16 — their inputs come from attention/GDN computation, not from a norm layer. ## Architecture | Parameter | Value | |---|---| | Total parameters | 27.4B (dense — all active) | | Hidden size | 5,120 | | Layers | 64 (48 linear + 16 full attention) | | Attention heads | 24 (4 KV heads, GQA 6:1) | | Head dimension | 256 | | Intermediate size | 17,408 | | Vocab size | 248,320 | | Max context | 262,144 tokens | ## Usage ```typescript import { loadSession } from '@mlx-node/lm'; const session = await loadSession('./Qwen3.6-27B-UD-NVFP4_K_XL-mlx'); for await (const event of session.sendStream('Explain NVFP4 vs MXFP4 quantization.', { config: { maxNewTokens: 2048, temperature: 0.6, reasoningEffort: 'low' }, })) { if (!event.done) process.stdout.write(event.text); } ``` ## How It Was Made ```bash mlx convert \ -i Qwen3.6-27B \ -o Qwen3.6-27B-UD-NVFP4_K_XL-mlx \ -q --q-mode nvfp4 --q-recipe unsloth \ --imatrix-path imatrix_unsloth.gguf ``` `--q-mode nvfp4` selects NVIDIA's NVFP4 micro-scaling format as the global dequantizer (`bits=4, group_size=16`). Combined with `--q-recipe unsloth`, the recipe emits per-tensor affine decisions for sensitive layers (q/k/v + AWQ, down_proj, lm_head, embed_tokens at higher bits; o_proj/out_proj/GDN as bf16), and the converter promotes the remaining 4-bit affine decisions (gate_proj/up_proj) to NVFP4. `--q-bits 4` is implicit when `--q-mode nvfp4` is set. ## Acknowledgments - **[Unsloth](https://unsloth.ai)** — Quantization strategy based on their [per-layer KLD benchmarks](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) and Dynamic 2.0 methodology - **[NVIDIA NVFP4](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)** — For the NVFP4 specification - **[Qwen Team](https://huggingface.co/Qwen)** — For the Qwen3.6 model family - **[Apple MLX](https://github.com/ml-explore/mlx)** — For the Metal-accelerated ML framework ## License [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) (inherited from base model).