--- license: apache-2.0 language: - en - zh base_model: Qwen/Qwen3.5-9B tags: - mlx - mlx-node - quantized - awq - 3-bit - qwen3.5 - hybrid-attention - gated-delta-net - apple-silicon - unsloth-dynamic library_name: mlx-node quantized_by: mlx-node pipeline_tag: text-generation model_type: qwen3_5 --- # Qwen3.5-9B — Unsloth Dynamic AWQ (3-bit, mlx-node) Mixed-precision 3/4/5/6-bit quantization of [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) for Apple Silicon, using the [**Unsloth Dynamic** quantization strategy](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) via [mlx-node](https://github.com/mlx-node/mlx-node). | | Original (BF16) | This Model | |---|---|---| | **Parameters** | 9,653,104,368 | 9,653,104,368 | | **Size** | 18 GB | **6.4 GB** | | **Format** | SafeTensors (4 shards) | SafeTensors (single file, 1100 tensors) | | **Precision** | BF16 uniform | Mixed 3/4/5/6-bit + BF16 | | **Reduction** | — | **64%** | ## Performance Tested on Apple Silicon M3 Max 128GB with [mlx-node](https://github.com/mlx-node/mlx-node): | Model | Size | Decode (tok/s) | Speedup | |---|---|---|---| | BF16 (unquantized) | 18 GB | 20.5–21.0 | baseline | | **This model (Unsloth, 3-bit base)** | **6.4 GB** | **54.1–54.6** | **~2.6x faster** | Decode is memory-bandwidth bound on Apple Silicon — fewer bytes to transfer per token directly translates to higher throughput. Embeddings and lm_head stay quantized in memory (5/6-bit) and use `quantized_matmul` on forward — no dequantize-at-load overhead. Attention q/k/v and SSM input projections are quantized at 5-bit with imatrix AWQ pre-scaling for near-lossless quality. Attention o_proj and SSM out_proj are kept at bf16 (no preceding norm for AWQ correction). ## Quantization Strategy This model's quantization recipe is based on the [Unsloth](https://unsloth.ai) team's extensive per-layer KL divergence benchmarking of the Qwen3.5 architecture. Their work — published as [Unsloth Dynamic 2.0](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) — is the most thorough public analysis of how Qwen3.5's hybrid GatedDeltaNet + full attention design responds to quantization, and the foundation for every decision in this model. ### Why Qwen3.5 Needs Special Treatment Qwen3.5 is not a standard transformer. It uses a **hybrid architecture**: 24 GatedDeltaNet linear attention layers + 8 standard full attention layers (`full_attention_interval=4`). Standard uniform quantization treats all layers equally, but Unsloth's benchmarks revealed that this architecture has **fundamentally different sensitivity profiles** across layer types: - Quantizing `ssm_out` (`linear_attn.out_proj`) at Q2_K **"does dramatically worse"** — KLD spikes far beyond other components - Attention tensors (`self_attn.*`) are **"especially sensitive for hybrid architectures"** — more so than in pure-attention models like LLaMA - Attention gates (`linear_attn.in_proj_z`) — MXFP4 **"performs poorly"** on these - FFN gate/up projections are **"generally ok to quantize to 3-bit"** — the only layers that tolerate aggressive compression well - FFN `down_proj` is **"slightly more sensitive"** than gate/up — benefits from an extra bit The key insight from Unsloth's work: **it's better to quantize sensitive layers at higher bits (5-bit with imatrix) and aggressively quantize the rest (3-bit), than to uniformly quantize everything at a middling bit-width.** Their Dynamic models consistently sit on the **Pareto frontier** for 99.9% KL divergence vs model size, outperforming uniform quantization at every size point. ### Unsloth imatrix: The Calibration Foundation The second pillar of Unsloth's approach is their **importance matrix (imatrix)** — per-channel calibration data that tells the quantizer which channels within each tensor carry the most information. Standard imatrix calibrations (used by most GGUF quantizers) run the model on **Wikipedia-512** — short encyclopedia passages. Unsloth instead calibrates on **long-context chat, coding, and tool-calling data**, which better represents how these models are actually used. From Unsloth's findings: - *"Imatrix definitely helps reduce KLD & PPL"* across all bit-widths - *"Imatrix generally helps on lower bits, and works on all quants and bit widths"* - SSM output at 2-bits was *"really bad"* without imatrix, but imatrix *"reduces the 99.9% KLD by a lot"* - Trade-off: I-quants make *"inference 5-10% slower"*, but the quality gain is substantial When an imatrix is provided to mlx-node's conversion pipeline, it applies **AWQ-style channel pre-scaling** before quantization: important input channels (high activation magnitude) are amplified to make them more quantization-resistant, while less important channels are shrunk. The inverse scales are fused into preceding layer norms, so there is **zero inference overhead** — the quality improvement is free at runtime. ### Per-Layer Decisions Based on Unsloth's per-tensor 99.9% KLD analysis (sorted by sensitivity, worst → best): | Component | Precision | Count | Unsloth Finding (99.9% KLD) | |---|---|---|---| | `self_attn.{q,k,v}_proj` | **5-bit** affine (gs=64) + AWQ | 24 tensors | KLD ~1.5–2.9 — "Especially sensitive for hybrid architectures"; AWQ-corrected via input_layernorm | | `self_attn.o_proj` | **BF16** (skip) | 8 tensors | KLD ~1.5 — no preceding norm for AWQ correction | | `linear_attn.in_proj_qkv` | **5-bit** affine (gs=64) + AWQ | 24 tensors | KLD ~2.9 — SSM input projection; AWQ-corrected via input_layernorm | | `linear_attn.in_proj_z` | **5-bit** affine (gs=64) + AWQ | 24 tensors | KLD ~1.5 — "Performs poorly with MXFP4"; AWQ-corrected via input_layernorm | | `linear_attn.out_proj` | **BF16** (skip) | 24 tensors | KLD ~6.0 at q2_k — worst tensor; no preceding norm for AWQ correction | | `linear_attn.A_log` | **BF16** (skip) | 24 tensors | State-space dynamics — not quantizable | | `linear_attn.conv1d` | **BF16** (skip) | 24 tensors | KLD ~0.05 — too small to quantize meaningfully | | `linear_attn.in_proj_{a,b}` | **BF16** (skip) | 48 tensors | Low-rank projections — too small | | `mlp.down_proj` | **4-bit** affine (gs=64) | 32 tensors | "Slightly more sensitive" than gate/up | | `mlp.gate_proj` | **3-bit** affine (gs=64) | 32 tensors | "Generally ok to quantize to 3-bit" | | `mlp.up_proj` | **3-bit** affine (gs=64) | 32 tensors | "Generally ok to quantize to 3-bit" | | `embed_tokens` | **5-bit** affine (gs=64) | 1 tensor | KLD ~0.15 at q5_k — among least sensitive | | `lm_head` | **6-bit** affine (gs=64) | 1 tensor | KLD ~0.05 at q5_k — safest tensor to quantize | | Norms | **BF16** (skip) | ~130 tensors | Never quantized (standard practice) | AWQ-correctable projections (q/k/v, in_proj_qkv/z) are quantized at 5-bit with imatrix AWQ pre-scaling via `input_layernorm`. Non-AWQ-correctable projections (o_proj, out_proj) are kept at bf16 — their inputs come from attention/GDN computation, not from a norm layer, so AWQ cannot be applied. imatrix is **required** for the unsloth recipe. ### Comparison with Unsloth GGUF (UD-Q3_K_XL) | Tensor | Unsloth UD-Q3_K_XL | Ours | Gap | |---|---|---|---| | attn q/k/v | Q5_K + imatrix | 5-bit affine + AWQ | Small (AWQ compensates) | | in_proj_qkv/z | Q5_K + imatrix | 5-bit affine + AWQ | Small | | o_proj | Q5_K + imatrix | bf16 | We're larger but lossless | | out_proj | Q5_K + imatrix | bf16 | We're larger but lossless | | FFN gate/up | Q3_K + imatrix | 3-bit affine + AWQ | Moderate (K-quant > affine at 3-bit) | | FFN down | Q4_K + imatrix | 4-bit affine + AWQ | Small | ## Architecture Qwen3.5-9B is a decoder-only transformer with a hybrid attention design: | Parameter | Value | |---|---| | Hidden size | 4,096 | | Layers | 32 (24 linear + 8 full attention) | | Attention heads | 16 (4 KV heads, GQA 4:1) | | Head dimension | 256 | | Intermediate size | 12,288 | | Vocab size | 248,320 | | Max context | 262,144 tokens | | RoPE | M-RoPE with `mrope_section=[11, 11, 10]`, theta=10M | | Activation | SiLU | **Layer pattern** (repeating): `[linear, linear, linear, full, linear, linear, linear, full, ...]` - **Linear attention layers** use GatedDeltaNet: depthwise Conv1d + gated delta recurrence (state-space model) - **Full attention layers** use standard grouped-query attention with KV caching ## Usage ### With mlx-node (TypeScript/JavaScript) ```typescript import { loadModel } from '@mlx-node/lm'; const model = await loadModel('./qwen3.5-9B-unsloth'); // Chat (single-shot) const result = await model.chat( [{ role: 'user', content: 'Explain the hybrid attention mechanism in Qwen3.5.' }], { maxNewTokens: 2048, temperature: 0.6, enableThinking: false }, ); console.log(result.text); // Streaming (AsyncGenerator) for await (const event of model.chatStream( [{ role: 'user', content: 'Write a haiku about coding.' }], { maxNewTokens: 512, temperature: 0.7 }, )) { if (!event.done) { process.stdout.write(event.text); } else { console.log('\nTokens:', event.numTokens); } } // Tool calling import { createToolDefinition } from '@mlx-node/lm'; const tools = [ createToolDefinition( 'get_weather', 'Get weather for a location', { location: { type: 'string', description: 'City name' } }, ['location'], ), ]; const result = await model.chat( [{ role: 'user', content: 'What is the weather in Tokyo?' }], { tools, maxNewTokens: 2048 }, ); for (const call of result.toolCalls) { console.log(call.name, call.arguments); } ``` ## How It Was Made Converted from [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) official SafeTensors using mlx-node's conversion pipeline: ```bash mlx convert \ -i .cache/models/qwen3.5-9B \ -o .cache/models/qwen3.5-9B-unsloth \ -q --q-recipe unsloth \ --imatrix-path imatrix_unsloth.gguf ``` The `--q-recipe unsloth` flag applies the differential quantization strategy described above. The recipe defaults to 3-bit base (override with `--q-bits`). The `--imatrix-path` is **required** for the unsloth recipe — it applies AWQ-style channel pre-scaling before quantization using Unsloth's importance matrix. The conversion pipeline: 1. Loads BF16 SafeTensors/GGUF weights via mmap (near-instant) 2. Applies Qwen3.5-specific weight sanitization (norm +1.0 shift, dtype handling) 3. Applies imatrix AWQ pre-scaling: important input channels are amplified (more quantization-resistant) while less important channels are shrunk, with inverse scales fused into preceding layer norms 4. Runs the Unsloth recipe predicate to classify each tensor 5. Quantizes attn q/k/v + SSM in_proj to 5-bit (AWQ-corrected), MLP gate/up to 3-bit, down to 4-bit, embed to 5-bit, lm_head to 6-bit 6. Skips o_proj, out_proj, norms, A_log, conv1d, and low-rank projections (kept BF16) 7. Writes single-file SafeTensors with per-layer quantization metadata in `config.json` Unsloth's imatrix uses long-context chat, coding, and tool-calling calibration data rather than standard Wikipedia-512 contexts. From Unsloth's findings: imatrix *"definitely helps reduce KLD & PPL"* across all bit-widths, and is especially impactful at lower bits (3-bit and below). ## Files | File | Size | Description | |---|---|---| | `model.safetensors` | 6.4 GB | Mixed-precision model weights | | `config.json` | 30 KB | Model config + per-layer quantization overrides | | `tokenizer.json` | 12 MB | HuggingFace tokenizer (248K vocab) | | `tokenizer_config.json` | 16 KB | Tokenizer settings + Jinja2 chat template | | `vocab.json` | 6.4 MB | Vocabulary mapping | | `merges.txt` | 3.2 MB | BPE merges | ## Chat Template The official Qwen3.5 chat template is preserved unmodified, supporting: - Multi-turn conversation - System messages - Tool calling (`` / `` tags) - Chain-of-thought reasoning (`` / `` tags) - Image/video content placeholders (for VLM variants) **Template compatibility fix:** The official Qwen3.5 template uses `raise_exception()` for input validation (8 call sites), which is not a built-in function in most Jinja2-compatible renderers. Unsloth [identified and fixed](https://unsloth.ai/docs/models/qwen3.5) chat template issues affecting tool-calling across all Qwen3.5 variants. mlx-node takes a complementary approach — rather than patching the template, we register `raise_exception` as a native function in our minijinja renderer, so the official template works as-is without modification. ## Acknowledgments - **[Unsloth](https://unsloth.ai)** ([GitHub](https://github.com/unslothai/unsloth)) — The quantization strategy in this model is directly based on Unsloth's [per-layer KL divergence benchmarks](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) and their Dynamic 2.0 quantization methodology. Their work on imatrix calibration with long-context chat and tool-calling data, and their systematic analysis of layer sensitivity in hybrid GatedDeltaNet architectures, made this recipe possible. We also use their published imatrix GGUF files for AWQ pre-scaling when converting from GGUF sources. - **[Qwen Team](https://huggingface.co/Qwen)** — For the Qwen3.5 model family and the hybrid attention architecture - **[Apple MLX](https://github.com/ml-explore/mlx)** — For the Metal-accelerated ML framework powering inference ## License This model inherits the [Apache 2.0 license](https://www.apache.org/licenses/LICENSE-2.0) from the base Qwen3.5-9B model.