---
license: apache-2.0
language:
  - en
  - zh
base_model: Qwen/Qwen3.5-9B
tags:
  - mlx
  - mlx-node
  - quantized
  - awq
  - 2-bit
  - qwen3.5
  - hybrid-attention
  - gated-delta-net
  - apple-silicon
  - unsloth-dynamic
library_name: mlx-node
quantized_by: mlx-node
pipeline_tag: text-generation
model_type: qwen3_5
---

# Qwen3.5-9B — UD-Q2_K_XL (mlx-node)

2-bit base mixed-precision quantization of [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) for Apple Silicon, using the [**Unsloth Dynamic** quantization strategy](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) via [mlx-node](https://github.com/mlx-node/mlx-node).

| | Original (BF16) | This Model |
|---|---|---|
| **Size** | ~18 GB | **5 GB** |
| **Format** | SafeTensors (sharded) | SafeTensors (single file) |
| **Precision** | BF16 uniform | Mixed 2/3/4/5/8-bit + BF16 |

## All Variants

| Repo | GGUF Equivalent | Size | Decode (tok/s) | Speedup vs BF16 |
|---|---|---|---|---|
| [Brooooooklyn/Qwen3.5-9B-UD-Q2_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.5-9B-UD-Q2_K_XL-mlx) | UD-Q2_K_XL | 5 GB | TBD | TBD |
| [Brooooooklyn/Qwen3.5-9B-UD-Q3_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.5-9B-UD-Q3_K_XL-mlx) | UD-Q3_K_XL | 6 GB | TBD | TBD |
| [Brooooooklyn/Qwen3.5-9B-UD-Q4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.5-9B-UD-Q4_K_XL-mlx) | UD-Q4_K_XL | 8 GB | TBD | TBD |
| [Brooooooklyn/Qwen3.5-9B-UD-Q5_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.5-9B-UD-Q5_K_XL-mlx) | UD-Q5_K_XL | 9 GB | TBD | TBD |
| [Brooooooklyn/Qwen3.5-9B-UD-Q6_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.5-9B-UD-Q6_K_XL-mlx) | UD-Q6_K_XL | 9 GB | TBD | TBD |
| [Brooooooklyn/Qwen3.5-9B-UD-Q8_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.5-9B-UD-Q8_K_XL-mlx) | UD-Q8_K_XL | 10 GB | TBD | TBD |

Benchmarked on Apple M3 Max 128GB, multi-turn chat (Turn 4 decode, steady-state).

## Per-Tensor Bit Assignments (N=2)

| Weight | Bits | Rationale |
|---|---|---|
| `embed_tokens` | 4-bit | KLD ~0.15 — very low sensitivity |
| `lm_head` | 5-bit | KLD ~0.05 — safest tensor |
| `self_attn.q/k/v_proj` | 4-bit + AWQ | KLD ~1.5–2.9, AWQ via layernorm |
| `linear_attn.in_proj_qkv/z` | 4-bit + AWQ | KLD ~2.9, AWQ via layernorm |
| `self_attn.o_proj` | **bf16** | NOT AWQ-correctable |
| `linear_attn.out_proj` | **bf16** | KLD ~6.0 — worst tensor |
| `down_proj` | 3-bit | "Slightly more sensitive" |
| `gate_proj`, `up_proj` | 2-bit | "Generally ok" at low bits |

## Quantization Strategy

Based on [Unsloth Dynamic 2.0](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) per-tensor KLD analysis. Sensitive layers get higher bits with AWQ correction, while FFN weights are aggressively quantized. imatrix AWQ pre-scaling amplifies important weight channels and fuses inverse scales into preceding layer norms (zero inference overhead).

**AWQ-correctable** projections (q/k/v, in_proj_qkv/z) are quantized at 4-bit via `input_layernorm`. **Non-AWQ-correctable** projections (o_proj, out_proj) are kept at bf16.

## Usage

```typescript
import { loadModel } from '@mlx-node/lm';

const model = await loadModel('./Qwen3.5-9B-UD-Q2_K_XL-mlx');

const result = await model.chat(
  [{ role: 'user', content: 'Explain the hybrid attention mechanism in Qwen3.5.' }],
  { maxNewTokens: 2048, temperature: 0.6, enableThinking: false },
);
console.log(result.text);
```

## How It Was Made

```bash
mlx convert \
  -i Qwen3.5-9B \
  -o Qwen3.5-9B-UD-Q2_K_XL-mlx \
  -q --q-bits 2 --q-recipe unsloth \
  --imatrix-path imatrix_unsloth.gguf
```

## Acknowledgments

- **[Unsloth](https://unsloth.ai)** — Quantization strategy based on their [per-layer KLD benchmarks](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) and Dynamic 2.0 methodology
- **[Qwen Team](https://huggingface.co/Qwen)** — For the Qwen3.5 model family
- **[Apple MLX](https://github.com/ml-explore/mlx)** — For the Metal-accelerated ML framework

## License

[Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) (inherited from base model).