--- license: apache-2.0 language: - en - zh base_model: Qwen/Qwen3.5-9B tags: - mlx - mlx-node - quantized - awq - 2-bit - qwen3.5 - hybrid-attention - gated-delta-net - apple-silicon - unsloth-dynamic library_name: mlx-node quantized_by: mlx-node pipeline_tag: text-generation model_type: qwen3_5 --- # Qwen3.5-9B — UD-Q2_K_XL (mlx-node) 2-bit base mixed-precision quantization of [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) for Apple Silicon, using the [**Unsloth Dynamic** quantization strategy](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) via [mlx-node](https://github.com/mlx-node/mlx-node). | | Original (BF16) | This Model | |---|---|---| | **Size** | ~18 GB | **5 GB** | | **Format** | SafeTensors (sharded) | SafeTensors (single file) | | **Precision** | BF16 uniform | Mixed 2/3/4/5/8-bit + BF16 | ## All Variants | Repo | GGUF Equivalent | Size | Decode (tok/s) | Speedup vs BF16 | |---|---|---|---|---| | [Brooooooklyn/Qwen3.5-9B-UD-Q2_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.5-9B-UD-Q2_K_XL-mlx) | UD-Q2_K_XL | 5 GB | TBD | TBD | | [Brooooooklyn/Qwen3.5-9B-UD-Q3_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.5-9B-UD-Q3_K_XL-mlx) | UD-Q3_K_XL | 6 GB | TBD | TBD | | [Brooooooklyn/Qwen3.5-9B-UD-Q4_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.5-9B-UD-Q4_K_XL-mlx) | UD-Q4_K_XL | 8 GB | TBD | TBD | | [Brooooooklyn/Qwen3.5-9B-UD-Q5_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.5-9B-UD-Q5_K_XL-mlx) | UD-Q5_K_XL | 9 GB | TBD | TBD | | [Brooooooklyn/Qwen3.5-9B-UD-Q6_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.5-9B-UD-Q6_K_XL-mlx) | UD-Q6_K_XL | 9 GB | TBD | TBD | | [Brooooooklyn/Qwen3.5-9B-UD-Q8_K_XL-mlx](https://huggingface.co/Brooooooklyn/Qwen3.5-9B-UD-Q8_K_XL-mlx) | UD-Q8_K_XL | 10 GB | TBD | TBD | Benchmarked on Apple M3 Max 128GB, multi-turn chat (Turn 4 decode, steady-state). ## Per-Tensor Bit Assignments (N=2) | Weight | Bits | Rationale | |---|---|---| | `embed_tokens` | 4-bit | KLD ~0.15 — very low sensitivity | | `lm_head` | 5-bit | KLD ~0.05 — safest tensor | | `self_attn.q/k/v_proj` | 4-bit + AWQ | KLD ~1.5–2.9, AWQ via layernorm | | `linear_attn.in_proj_qkv/z` | 4-bit + AWQ | KLD ~2.9, AWQ via layernorm | | `self_attn.o_proj` | **bf16** | NOT AWQ-correctable | | `linear_attn.out_proj` | **bf16** | KLD ~6.0 — worst tensor | | `down_proj` | 3-bit | "Slightly more sensitive" | | `gate_proj`, `up_proj` | 2-bit | "Generally ok" at low bits | ## Quantization Strategy Based on [Unsloth Dynamic 2.0](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) per-tensor KLD analysis. Sensitive layers get higher bits with AWQ correction, while FFN weights are aggressively quantized. imatrix AWQ pre-scaling amplifies important weight channels and fuses inverse scales into preceding layer norms (zero inference overhead). **AWQ-correctable** projections (q/k/v, in_proj_qkv/z) are quantized at 4-bit via `input_layernorm`. **Non-AWQ-correctable** projections (o_proj, out_proj) are kept at bf16. ## Usage ```typescript import { loadModel } from '@mlx-node/lm'; const model = await loadModel('./Qwen3.5-9B-UD-Q2_K_XL-mlx'); const result = await model.chat( [{ role: 'user', content: 'Explain the hybrid attention mechanism in Qwen3.5.' }], { maxNewTokens: 2048, temperature: 0.6, enableThinking: false }, ); console.log(result.text); ``` ## How It Was Made ```bash mlx convert \ -i Qwen3.5-9B \ -o Qwen3.5-9B-UD-Q2_K_XL-mlx \ -q --q-bits 2 --q-recipe unsloth \ --imatrix-path imatrix_unsloth.gguf ``` ## Acknowledgments - **[Unsloth](https://unsloth.ai)** — Quantization strategy based on their [per-layer KLD benchmarks](https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks) and Dynamic 2.0 methodology - **[Qwen Team](https://huggingface.co/Qwen)** — For the Qwen3.5 model family - **[Apple MLX](https://github.com/ml-explore/mlx)** — For the Metal-accelerated ML framework ## License [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) (inherited from base model).