---
license: apache-2.0
language:
  - en
  - zh
base_model: Qwen/Qwen3.5-9B
tags:
  - mlx
  - mlx-node
  - quantized
  - awq
  - 3-bit
  - qwen3.5
  - hybrid-attention
  - gated-delta-net
  - apple-silicon
  - unsloth-dynamic
library_name: mlx-node
quantized_by: mlx-node
pipeline_tag: text-generation
model_type: qwen3_5
---

# Qwen3.5-9B — Unsloth Dynamic AWQ (3-bit, mlx-node)

Mixed-precision 3/4/5/6-bit quantization of [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) for Apple Silicon, using the [**Unsloth Dynamic** quantization strategy](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) via [mlx-node](https://github.com/mlx-node/mlx-node).

| | Original (BF16) | This Model |
|---|---|---|
| **Parameters** | 9,653,104,368 | 9,653,104,368 |
| **Size** | 18 GB | **6.4 GB** |
| **Format** | SafeTensors (4 shards) | SafeTensors (single file, 1100 tensors) |
| **Precision** | BF16 uniform | Mixed 3/4/5/6-bit + BF16 |
| **Reduction** | — | **64%** |

## Performance

Tested on Apple Silicon M3 Max 128GB with [mlx-node](https://github.com/mlx-node/mlx-node):

| Model | Size | Decode (tok/s) | Speedup |
|---|---|---|---|
| BF16 (unquantized) | 18 GB | 20.5–21.0 | baseline |
| **This model (Unsloth, 3-bit base)** | **6.4 GB** | **54.1–54.6** | **~2.6x faster** |

Decode is memory-bandwidth bound on Apple Silicon — fewer bytes to transfer per token directly translates to higher throughput. Embeddings and lm_head stay quantized in memory (5/6-bit) and use `quantized_matmul` on forward — no dequantize-at-load overhead. Attention q/k/v and SSM input projections are quantized at 5-bit with imatrix AWQ pre-scaling for near-lossless quality. Attention o_proj and SSM out_proj are kept at bf16 (no preceding norm for AWQ correction).

## Quantization Strategy

This model's quantization recipe is based on the [Unsloth](https://unsloth.ai) team's extensive per-layer KL divergence benchmarking of the Qwen3.5 architecture. Their work — published as [Unsloth Dynamic 2.0](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) — is the most thorough public analysis of how Qwen3.5's hybrid GatedDeltaNet + full attention design responds to quantization, and the foundation for every decision in this model.

### Why Qwen3.5 Needs Special Treatment

Qwen3.5 is not a standard transformer. It uses a **hybrid architecture**: 24 GatedDeltaNet linear attention layers + 8 standard full attention layers (`full_attention_interval=4`). Standard uniform quantization treats all layers equally, but Unsloth's benchmarks revealed that this architecture has **fundamentally different sensitivity profiles** across layer types:

- Quantizing `ssm_out` (`linear_attn.out_proj`) at Q2_K **"does dramatically worse"** — KLD spikes far beyond other components
- Attention tensors (`self_attn.*`) are **"especially sensitive for hybrid architectures"** — more so than in pure-attention models like LLaMA
- Attention gates (`linear_attn.in_proj_z`) — MXFP4 **"performs poorly"** on these
- FFN gate/up projections are **"generally ok to quantize to 3-bit"** — the only layers that tolerate aggressive compression well
- FFN `down_proj` is **"slightly more sensitive"** than gate/up — benefits from an extra bit

The key insight from Unsloth's work: **it's better to quantize sensitive layers at higher bits (5-bit with imatrix) and aggressively quantize the rest (3-bit), than to uniformly quantize everything at a middling bit-width.** Their Dynamic models consistently sit on the **Pareto frontier** for 99.9% KL divergence vs model size, outperforming uniform quantization at every size point.

### Unsloth imatrix: The Calibration Foundation

The second pillar of Unsloth's approach is their **importance matrix (imatrix)** — per-channel calibration data that tells the quantizer which channels within each tensor carry the most information.

Standard imatrix calibrations (used by most GGUF quantizers) run the model on **Wikipedia-512** — short encyclopedia passages. Unsloth instead calibrates on **long-context chat, coding, and tool-calling data**, which better represents how these models are actually used. From Unsloth's findings:

- *"Imatrix definitely helps reduce KLD & PPL"* across all bit-widths
- *"Imatrix generally helps on lower bits, and works on all quants and bit widths"*
- SSM output at 2-bits was *"really bad"* without imatrix, but imatrix *"reduces the 99.9% KLD by a lot"*
- Trade-off: I-quants make *"inference 5-10% slower"*, but the quality gain is substantial

When an imatrix is provided to mlx-node's conversion pipeline, it applies **AWQ-style channel pre-scaling** before quantization: important input channels (high activation magnitude) are amplified to make them more quantization-resistant, while less important channels are shrunk. The inverse scales are fused into preceding layer norms, so there is **zero inference overhead** — the quality improvement is free at runtime.

### Per-Layer Decisions

Based on Unsloth's per-tensor 99.9% KLD analysis (sorted by sensitivity, worst → best):

| Component | Precision | Count | Unsloth Finding (99.9% KLD) |
|---|---|---|---|
| `self_attn.{q,k,v}_proj` | **5-bit** affine (gs=64) + AWQ | 24 tensors | KLD ~1.5–2.9 — "Especially sensitive for hybrid architectures"; AWQ-corrected via input_layernorm |
| `self_attn.o_proj` | **BF16** (skip) | 8 tensors | KLD ~1.5 — no preceding norm for AWQ correction |
| `linear_attn.in_proj_qkv` | **5-bit** affine (gs=64) + AWQ | 24 tensors | KLD ~2.9 — SSM input projection; AWQ-corrected via input_layernorm |
| `linear_attn.in_proj_z` | **5-bit** affine (gs=64) + AWQ | 24 tensors | KLD ~1.5 — "Performs poorly with MXFP4"; AWQ-corrected via input_layernorm |
| `linear_attn.out_proj` | **BF16** (skip) | 24 tensors | KLD ~6.0 at q2_k — worst tensor; no preceding norm for AWQ correction |
| `linear_attn.A_log` | **BF16** (skip) | 24 tensors | State-space dynamics — not quantizable |
| `linear_attn.conv1d` | **BF16** (skip) | 24 tensors | KLD ~0.05 — too small to quantize meaningfully |
| `linear_attn.in_proj_{a,b}` | **BF16** (skip) | 48 tensors | Low-rank projections — too small |
| `mlp.down_proj` | **4-bit** affine (gs=64) | 32 tensors | "Slightly more sensitive" than gate/up |
| `mlp.gate_proj` | **3-bit** affine (gs=64) | 32 tensors | "Generally ok to quantize to 3-bit" |
| `mlp.up_proj` | **3-bit** affine (gs=64) | 32 tensors | "Generally ok to quantize to 3-bit" |
| `embed_tokens` | **5-bit** affine (gs=64) | 1 tensor | KLD ~0.15 at q5_k — among least sensitive |
| `lm_head` | **6-bit** affine (gs=64) | 1 tensor | KLD ~0.05 at q5_k — safest tensor to quantize |
| Norms | **BF16** (skip) | ~130 tensors | Never quantized (standard practice) |

AWQ-correctable projections (q/k/v, in_proj_qkv/z) are quantized at 5-bit with imatrix AWQ pre-scaling via `input_layernorm`. Non-AWQ-correctable projections (o_proj, out_proj) are kept at bf16 — their inputs come from attention/GDN computation, not from a norm layer, so AWQ cannot be applied. imatrix is **required** for the unsloth recipe.

### Comparison with Unsloth GGUF (UD-Q3_K_XL)

| Tensor | Unsloth UD-Q3_K_XL | Ours | Gap |
|---|---|---|---|
| attn q/k/v | Q5_K + imatrix | 5-bit affine + AWQ | Small (AWQ compensates) |
| in_proj_qkv/z | Q5_K + imatrix | 5-bit affine + AWQ | Small |
| o_proj | Q5_K + imatrix | bf16 | We're larger but lossless |
| out_proj | Q5_K + imatrix | bf16 | We're larger but lossless |
| FFN gate/up | Q3_K + imatrix | 3-bit affine + AWQ | Moderate (K-quant > affine at 3-bit) |
| FFN down | Q4_K + imatrix | 4-bit affine + AWQ | Small |

## Architecture

Qwen3.5-9B is a decoder-only transformer with a hybrid attention design:

| Parameter | Value |
|---|---|
| Hidden size | 4,096 |
| Layers | 32 (24 linear + 8 full attention) |
| Attention heads | 16 (4 KV heads, GQA 4:1) |
| Head dimension | 256 |
| Intermediate size | 12,288 |
| Vocab size | 248,320 |
| Max context | 262,144 tokens |
| RoPE | M-RoPE with `mrope_section=[11, 11, 10]`, theta=10M |
| Activation | SiLU |

**Layer pattern** (repeating): `[linear, linear, linear, full, linear, linear, linear, full, ...]`

- **Linear attention layers** use GatedDeltaNet: depthwise Conv1d + gated delta recurrence (state-space model)
- **Full attention layers** use standard grouped-query attention with KV caching

## Usage

### With mlx-node (TypeScript/JavaScript)

```typescript
import { loadModel } from '@mlx-node/lm';

const model = await loadModel('./qwen3.5-9B-unsloth');

// Chat (single-shot)
const result = await model.chat(
  [{ role: 'user', content: 'Explain the hybrid attention mechanism in Qwen3.5.' }],
  { maxNewTokens: 2048, temperature: 0.6, enableThinking: false },
);
console.log(result.text);

// Streaming (AsyncGenerator)
for await (const event of model.chatStream(
  [{ role: 'user', content: 'Write a haiku about coding.' }],
  { maxNewTokens: 512, temperature: 0.7 },
)) {
  if (!event.done) {
    process.stdout.write(event.text);
  } else {
    console.log('\nTokens:', event.numTokens);
  }
}

// Tool calling
import { createToolDefinition } from '@mlx-node/lm';

const tools = [
  createToolDefinition(
    'get_weather',
    'Get weather for a location',
    { location: { type: 'string', description: 'City name' } },
    ['location'],
  ),
];

const result = await model.chat(
  [{ role: 'user', content: 'What is the weather in Tokyo?' }],
  { tools, maxNewTokens: 2048 },
);
for (const call of result.toolCalls) {
  console.log(call.name, call.arguments);
}
```

## How It Was Made

Converted from [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) official SafeTensors using mlx-node's conversion pipeline:

```bash
mlx convert \
  -i .cache/models/qwen3.5-9B \
  -o .cache/models/qwen3.5-9B-unsloth \
  -q --q-recipe unsloth \
  --imatrix-path imatrix_unsloth.gguf
```

The `--q-recipe unsloth` flag applies the differential quantization strategy described above. The recipe defaults to 3-bit base (override with `--q-bits`). The `--imatrix-path` is **required** for the unsloth recipe — it applies AWQ-style channel pre-scaling before quantization using Unsloth's importance matrix. The conversion pipeline:

1. Loads BF16 SafeTensors/GGUF weights via mmap (near-instant)
2. Applies Qwen3.5-specific weight sanitization (norm +1.0 shift, dtype handling)
3. Applies imatrix AWQ pre-scaling: important input channels are amplified (more quantization-resistant) while less important channels are shrunk, with inverse scales fused into preceding layer norms
4. Runs the Unsloth recipe predicate to classify each tensor
5. Quantizes attn q/k/v + SSM in_proj to 5-bit (AWQ-corrected), MLP gate/up to 3-bit, down to 4-bit, embed to 5-bit, lm_head to 6-bit
6. Skips o_proj, out_proj, norms, A_log, conv1d, and low-rank projections (kept BF16)
7. Writes single-file SafeTensors with per-layer quantization metadata in `config.json`

Unsloth's imatrix uses long-context chat, coding, and tool-calling calibration data rather than standard Wikipedia-512 contexts. From Unsloth's findings: imatrix *"definitely helps reduce KLD & PPL"* across all bit-widths, and is especially impactful at lower bits (3-bit and below).

## Files

| File | Size | Description |
|---|---|---|
| `model.safetensors` | 6.4 GB | Mixed-precision model weights |
| `config.json` | 30 KB | Model config + per-layer quantization overrides |
| `tokenizer.json` | 12 MB | HuggingFace tokenizer (248K vocab) |
| `tokenizer_config.json` | 16 KB | Tokenizer settings + Jinja2 chat template |
| `vocab.json` | 6.4 MB | Vocabulary mapping |
| `merges.txt` | 3.2 MB | BPE merges |

## Chat Template

The official Qwen3.5 chat template is preserved unmodified, supporting:
- Multi-turn conversation
- System messages
- Tool calling (`<tool_call>` / `</tool_call>` tags)
- Chain-of-thought reasoning (`<think>` / `</think>` tags)
- Image/video content placeholders (for VLM variants)

**Template compatibility fix:** The official Qwen3.5 template uses `raise_exception()` for input validation (8 call sites), which is not a built-in function in most Jinja2-compatible renderers. Unsloth [identified and fixed](https://unsloth.ai/docs/models/qwen3.5) chat template issues affecting tool-calling across all Qwen3.5 variants. mlx-node takes a complementary approach — rather than patching the template, we register `raise_exception` as a native function in our minijinja renderer, so the official template works as-is without modification.

## Acknowledgments

- **[Unsloth](https://unsloth.ai)** ([GitHub](https://github.com/unslothai/unsloth)) — The quantization strategy in this model is directly based on Unsloth's [per-layer KL divergence benchmarks](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) and their Dynamic 2.0 quantization methodology. Their work on imatrix calibration with long-context chat and tool-calling data, and their systematic analysis of layer sensitivity in hybrid GatedDeltaNet architectures, made this recipe possible. We also use their published imatrix GGUF files for AWQ pre-scaling when converting from GGUF sources.
- **[Qwen Team](https://huggingface.co/Qwen)** — For the Qwen3.5 model family and the hybrid attention architecture
- **[Apple MLX](https://github.com/ml-explore/mlx)** — For the Metal-accelerated ML framework powering inference

## License

This model inherits the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0) from the base Qwen3.5-9B model.