---
license: apache-2.0
language:
  - en
  - zh
base_model: Qwen/Qwen3.6-27B
tags:
  - mlx
  - mlx-node
  - quantized
  - awq
  - nvfp4
  - micro-scaling-fp
  - qwen3.6
  - hybrid-attention
  - gated-delta-net
  - apple-silicon
  - unsloth-dynamic
library_name: mlx-node
quantized_by: mlx-node
pipeline_tag: text-generation
model_type: qwen3_5
---

# Qwen3.6-27B — UD-NVFP4_K_XL (mlx-node)

NVFP4 (NVIDIA Blackwell FP4) quantization of [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) for Apple Silicon, using the [**Unsloth Dynamic** quantization strategy](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) via [mlx-node](https://github.com/mlx-node/mlx-node).

| | Original (BF16) | UD-Q4_K_XL (affine) | This Model |
|---|---|---|---|
| **Size** | ~66 GB | 21 GB | **21 GB** |
| **Format** | SafeTensors | SafeTensors | SafeTensors |
| **Precision** | BF16 uniform | 4-bit affine + BF16 | NVFP4 (E4M3 scales) + mixed affine + BF16 |
| **FFN group size** | — | 64 | **16 (nvfp4) / 64 (affine)** |
| **Biases** | — | yes | **no (FFN nvfp4); yes (affine layers)** |

## What is NVFP4?

NVFP4 is NVIDIA's [Blackwell FP4 micro-scaling format](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/). Each group of **16** elements shares a single 8-bit **E4M3** scale (a full FP8 number with mantissa, not just an exponent), and elements themselves are stored as E2M1 FP4 values. Compared to MXFP4 (OCP):

- **Higher fidelity scale**: E4M3 scale has 3 mantissa bits, MXFP4's E8M0 only encodes a power-of-two
- **Smaller group**: 16 vs. 32 — fewer outliers per scale, better dynamic range tracking
- **Higher scale density**: 2× the scales per weight (16 elements vs 32 per scale), so per-byte overhead is roughly the same as MXFP4 despite the richer scale type
- **No biases**: zero-point implicit (FP4 covers ±range)

For dense LLM weights, NVFP4 typically beats MXFP4 on perplexity at the same nominal bit budget, with the trade-off of slightly more metadata per weight.

## All Variants

| Repo | GGUF Equivalent | Size | Decode (tok/s) |
|---|---|---|---|
| [Brooooooklyn/Qwen3.6-27B-UD-Q2_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q2_K_XL-mlx) | UD-Q2_K_XL | 15 GB | 20.0 |
| [Brooooooklyn/Qwen3.6-27B-UD-Q3_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q3_K_XL-mlx) | UD-Q3_K_XL | 18 GB | 16.2 |
| [Brooooooklyn/Qwen3.6-27B-UD-MXFP4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-MXFP4_K_XL-mlx) | — | 21 GB | 15.9 |
| **[Brooooooklyn/Qwen3.6-27B-UD-NVFP4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-NVFP4_K_XL-mlx) (this model)** | **—** | **21 GB** | **14.9** |
| [Brooooooklyn/Qwen3.6-27B-UD-Q4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q4_K_XL-mlx) | UD-Q4_K_XL | 21 GB | 15.3 |
| [Brooooooklyn/Qwen3.6-27B-UD-Q5_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q5_K_XL-mlx) | UD-Q5_K_XL | 25 GB | 13.4 |
| [Brooooooklyn/Qwen3.6-27B-UD-Q6_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q6_K_XL-mlx) | UD-Q6_K_XL | 27 GB | 12.4 |
| [Brooooooklyn/Qwen3.6-27B-UD-MXFP8_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-MXFP8_K_XL-mlx) | — | 29 GB | 10.5 |
| [Brooooooklyn/Qwen3.6-27B-UD-Q8_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q8_K_XL-mlx) | UD-Q8_K_XL | 30 GB | 9.9 |

Benchmarked on Apple M3 Max 128GB via [`examples/lm.ts`](https://github.com/mlx-node/mlx-node/blob/main/examples/lm.ts) (best decode tok/s across turns 2–4, steady-state).

## Performance

Steady-state decode: **14.9 tok/s** on Apple M3 Max 128GB (best of turns 2–4, `examples/lm.ts` capitals chat with `reasoningEffort: 'low'`). Decode is memory-bandwidth bound on Apple Silicon — fewer bytes per token directly translates to higher throughput. This NVFP4 build preserves bf16 escape hatches for the most sensitive tensors (`linear_attn.out_proj`, `self_attn.o_proj`), so it trades a few tok/s for measurably better quality vs. a uniform-FP4 layout.

## Per-Tensor Bit Assignments (N=4)

| Weight | Mode | Bits | Group | Rationale |
|---|---|---|---|---|
| `embed_tokens` | 6-bit affine | 6 | 64 | Loader is affine-only; nvfp upgrade skipped (unsloth `base+2`) |
| `lm_head` | 8-bit affine | 8 | 64 | Loader is affine-only; nvfp upgrade skipped (unsloth `base+3` snapped 7→8) |
| `self_attn.q/k/v_proj` | 6-bit affine + AWQ | 6 | 64 | AWQ via input_layernorm; preserved at higher bits (unsloth `base+2`) |
| `linear_attn.in_proj_qkv/z` | 6-bit affine + AWQ | 6 | 64 | AWQ via input_layernorm; preserved at higher bits (unsloth `base+2`) |
| `self_attn.o_proj` | **bf16** | — | — | NOT AWQ-correctable |
| `linear_attn.out_proj` | **bf16** | — | — | KLD ~6.0 — worst tensor, kept full-precision |
| `mlp.down_proj` | 5-bit affine | 5 | 64 | "Slightly more sensitive" (unsloth `base+1`) |
| `mlp.gate_proj`, `mlp.up_proj` | **nvfp4** | 4 | 16 | Unsloth UD-Q4 base — promoted to NVFP4 |
| GDN params (A_log, etc) | **bf16** | — | — | State-space dynamics |

## Quantization Strategy

Built on Unsloth Dynamic 2.0 per-tensor KLD analysis. At `--q-bits 4` the unsloth recipe's per-layer bit offsets become 4-bit FFN gate/up (promoted to NVFP4), 5-bit `down_proj`, 6-bit attn/SSM projections with AWQ pre-scaling, 6-bit `embed_tokens`, and 8-bit `lm_head`. Then `--q-mode nvfp4` orthogonally promotes only the 4-bit affine decisions to NVFP4 (`mode="nvfp4", bits=4, group_size=16`) — non-4-bit decisions stay affine at their original bit width, and AWQ-uncorrectable projections (`o_proj`, `out_proj`) stay bf16.

imatrix AWQ pre-scaling amplifies important weight channels and fuses inverse scales into preceding layer norms (zero inference overhead). **AWQ-correctable** projections (q/k/v, in_proj_qkv/z) get the AWQ pass; **non-AWQ-correctable** projections (o_proj, out_proj) stay bf16 — their inputs come from attention/GDN computation, not from a norm layer.

## Architecture

| Parameter | Value |
|---|---|
| Total parameters | 27.4B (dense — all active) |
| Hidden size | 5,120 |
| Layers | 64 (48 linear + 16 full attention) |
| Attention heads | 24 (4 KV heads, GQA 6:1) |
| Head dimension | 256 |
| Intermediate size | 17,408 |
| Vocab size | 248,320 |
| Max context | 262,144 tokens |

## Usage

```typescript
import { loadSession } from '@mlx-node/lm';

const session = await loadSession('./Qwen3.6-27B-UD-NVFP4_K_XL-mlx');

for await (const event of session.sendStream('Explain NVFP4 vs MXFP4 quantization.', {
  config: { maxNewTokens: 2048, temperature: 0.6, reasoningEffort: 'low' },
})) {
  if (!event.done) process.stdout.write(event.text);
}
```

## How It Was Made

```bash
mlx convert \
  -i Qwen3.6-27B \
  -o Qwen3.6-27B-UD-NVFP4_K_XL-mlx \
  -q --q-mode nvfp4 --q-recipe unsloth \
  --imatrix-path imatrix_unsloth.gguf
```

`--q-mode nvfp4` selects NVIDIA's NVFP4 micro-scaling format as the global dequantizer (`bits=4, group_size=16`). Combined with `--q-recipe unsloth`, the recipe emits per-tensor affine decisions for sensitive layers (q/k/v + AWQ, down_proj, lm_head, embed_tokens at higher bits; o_proj/out_proj/GDN as bf16), and the converter promotes the remaining 4-bit affine decisions (gate_proj/up_proj) to NVFP4. `--q-bits 4` is implicit when `--q-mode nvfp4` is set.

## Acknowledgments

- **[Unsloth](https://unsloth.ai)** — Quantization strategy based on their [per-layer KLD benchmarks](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) and Dynamic 2.0 methodology
- **[NVIDIA NVFP4](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)** — For the NVFP4 specification
- **[Qwen Team](https://huggingface.co/Qwen)** — For the Qwen3.6 model family
- **[Apple MLX](https://github.com/ml-explore/mlx)** — For the Metal-accelerated ML framework

## License

[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) (inherited from base model).