---
license: apache-2.0
language:
  - en
  - zh
base_model: Qwen/Qwen3.6-27B
tags:
  - mlx
  - mlx-node
  - quantized
  - awq
  - 2-bit
  - qwen3.6
  - hybrid-attention
  - gated-delta-net
  - apple-silicon
  - unsloth-dynamic
library_name: mlx-node
quantized_by: mlx-node
pipeline_tag: text-generation
model_type: qwen3_5
---

# Qwen3.6-27B — UD-Q2_K_XL (mlx-node)

2-bit base mixed-precision quantization of [Qwen/Qwen3.6-27B](https://huggingface.co/Qwen/Qwen3.6-27B) for Apple Silicon, using the [**Unsloth Dynamic** quantization strategy](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) via [mlx-node](https://github.com/mlx-node/mlx-node).

| | Original (BF16) | This Model |
|---|---|---|
| **Size** | ~51 GB | **15 GB** |
| **Format** | SafeTensors (sharded) | SafeTensors (sharded) |
| **Precision** | BF16 uniform | Mixed 2-bit + BF16 |

## All Variants

| Repo | GGUF Equivalent | Size | Decode (tok/s) |
|---|---|---|---|
| **[Brooooooklyn/Qwen3.6-27B-UD-Q2_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q2_K_XL-mlx) (this model)** | **UD-Q2_K_XL** | **15 GB** | **20.0** |
| [Brooooooklyn/Qwen3.6-27B-UD-Q3_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q3_K_XL-mlx) | UD-Q3_K_XL | 18 GB | 16.2 |
| [Brooooooklyn/Qwen3.6-27B-UD-NVFP4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-NVFP4_K_XL-mlx) | — | 21 GB | 14.9 |
| [Brooooooklyn/Qwen3.6-27B-UD-MXFP4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-MXFP4_K_XL-mlx) | — | 21 GB | 15.9 |
| [Brooooooklyn/Qwen3.6-27B-UD-Q4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q4_K_XL-mlx) | UD-Q4_K_XL | 21 GB | 15.3 |
| [Brooooooklyn/Qwen3.6-27B-UD-Q5_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q5_K_XL-mlx) | UD-Q5_K_XL | 25 GB | 13.4 |
| [Brooooooklyn/Qwen3.6-27B-UD-Q6_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q6_K_XL-mlx) | UD-Q6_K_XL | 27 GB | 12.4 |
| [Brooooooklyn/Qwen3.6-27B-UD-MXFP8_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-MXFP8_K_XL-mlx) | — | 29 GB | 10.5 |
| [Brooooooklyn/Qwen3.6-27B-UD-Q8_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.6-27B-UD-Q8_K_XL-mlx) | UD-Q8_K_XL | 30 GB | 9.9 |

Benchmarked on Apple M3 Max 128GB via [`examples/lm.ts`](https://github.com/mlx-node/mlx-node/blob/main/examples/lm.ts) (best decode tok/s across turns 2–4, steady-state).

## Performance

Steady-state decode: **20.0 tok/s** on Apple M3 Max 128GB (best of turns 2–4, `examples/lm.ts` capitals chat with `reasoningEffort: 'low'`). Decode is memory-bandwidth bound on Apple Silicon — fewer bytes per token directly translates to higher throughput.

## Per-Tensor Bit Assignments (N=2)

| Weight | Bits | Rationale |
|---|---|---|
| `embed_tokens` | 4-bit | KLD ~0.15 — very low sensitivity |
| `lm_head` | 5-bit | KLD ~0.05 — safest tensor |
| `self_attn.q/k/v_proj` | 4-bit + AWQ | KLD ~1.5–2.9, AWQ via layernorm |
| `linear_attn.in_proj_qkv/z` | 4-bit + AWQ | KLD ~2.9, AWQ via layernorm |
| `self_attn.o_proj` | **bf16** | NOT AWQ-correctable |
| `linear_attn.out_proj` | **bf16** | KLD ~6.0 — worst tensor |
| `down_proj` | 3-bit | "Slightly more sensitive" |
| `gate_proj`, `up_proj` | 2-bit | base bits |
| GDN params (A_log, etc) | **bf16** | State-space dynamics |

## Quantization Strategy

Based on [Unsloth Dynamic 2.0](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) per-tensor KLD analysis. Sensitive layers get higher bits with AWQ correction, while the bulk of FFN weights are aggressively quantized. imatrix AWQ pre-scaling amplifies important weight channels and fuses inverse scales into preceding layer norms (zero inference overhead).

**AWQ-correctable** projections (q/k/v, in_proj_qkv/z) are quantized at 4-bit via `input_layernorm`. **Non-AWQ-correctable** projections (o_proj, out_proj) are kept at bf16 — their inputs come from attention/GDN computation, not from a norm layer.

## Architecture

| Parameter | Value |
|---|---|
| Total parameters | 27.4B (dense — all active) |
| Hidden size | 5,120 |
| Layers | 64 (48 linear + 16 full attention) |
| Attention heads | 24 (4 KV heads, GQA 6:1) |
| Head dimension | 256 |
| Intermediate size | 17,408 |
| Vocab size | 248,320 |
| Max context | 262,144 tokens |

## Usage

```typescript
import { loadSession } from '@mlx-node/lm';

const session = await loadSession('./Qwen3.6-27B-UD-Q2_K_XL-mlx');

for await (const event of session.sendStream('Explain the hybrid attention mechanism in Qwen3.6.', {
  config: { maxNewTokens: 2048, temperature: 0.6, reasoningEffort: 'low' },
})) {
  if (!event.done) process.stdout.write(event.text);
}
```

## How It Was Made

```bash
mlx convert \
  -i Qwen3.6-27B \
  -o Qwen3.6-27B-UD-Q2_K_XL-mlx \
  -q --q-bits 2 --q-recipe unsloth \
  --imatrix-path imatrix_unsloth.gguf
```

## Acknowledgments

- **[Unsloth](https://unsloth.ai)** — Quantization strategy based on their [per-layer KLD benchmarks](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) and Dynamic 2.0 methodology
- **[Qwen Team](https://huggingface.co/Qwen)** — For the Qwen3.6 model family
- **[Apple MLX](https://github.com/ml-explore/mlx)** — For the Metal-accelerated ML framework

## License

[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) (inherited from base model).