--- license: apache-2.0 base_model: Qwen/Qwen3-30B-A3B-Instruct-2507 base_model_relation: quantized tags: - gguf - qwen3moe - imatrix - asymmetric-quantization - 2-bit - moe pipeline_tag: text-generation --- # Qwen3-30B-A3B-Instruct-2507 — Asymmetric 2-bit-Expert GGUF (imatrix) An **asymmetric, expert-aware** quantization of [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507) (arch `qwen3moe`, 128 routed experts, 8 active per token, ~3B active params, 48 layers). The idea (the "antirez" insight): in a routed-MoE model the bulk of the weights live in the expert FFNs, and most experts are only sparsely active. Push the routed experts to **2-bit** where the model is most redundant, keep the **attention and the embedding/output weights at higher precision** where error is most damaging, and steer the per-tensor bit-allocation with an **importance matrix (imatrix)**. The result fits comfortably in **16 GB** with only a modest perplexity cost versus the standard 4-bit baseline. ## Asymmetric quantization scheme | Tensor group | Type | Count | |---|---|---| | Routed expert **gate** (`ffn_gate_exps`) | `IQ2_S` | 48 | | Routed expert **up** (`ffn_up_exps`) | `IQ2_S` | 48 | | Routed expert **down** (`ffn_down_exps`) | `IQ3_S` | 48 | | Attention `q/k/v/output` | `Q4_K` | 192 | | `token_embd` | `Q6_K` | 1 | | `output` (lm_head) | `Q6_K` | 1 | Notes: - **`down` experts get an extra bit (`IQ3_S`)** — they are more error-sensitive than `gate`/`up`, so they are protected. - This architecture has **no shared expert** — all FFN experts are routed, so there is no always-on expert to hold separately at high precision. - Quantization was guided by an **imatrix** computed over `bartowski/calibration_datav3.txt` (128 chunks, ctx 512). `imatrix.dat` is included in this repo. **Effective rate: 2.99 BPW**, on-disk **11.4 GB** (10.14 GiB). ## Quality (perplexity, wikitext-2 raw test, 200 chunks @ ctx 512) | Model | PPL | Δ vs Q4_K_M | |---|---|---| | This asym 2-bit-expert (2.99 BPW, 11.4 GB) | **7.62** | +0.31 (+4.2%) | | Standard `Q4_K_M` (~4.8 BPW, 18.6 GB) | **7.32** | — | PPL measured with the same harness and chunk count for both. Lower is better. The asym build trades a small PPL increase for a **~38% smaller** file that clears the 16 GB bar. ## 16 GB fit - Weights on disk / in VRAM: **11.4 GB**. - KV cache (48 layers, GQA `n_head_kv = 4`, `head_dim = 128`) at f16 is ~0.092 MB/token, so a **16K-token** context adds ~**1.5 GB**. - 11.4 GB weights + ~1.5 GB KV (16K ctx) + runtime overhead ≈ **~14 GB < 16 GB**. ✅ Use a quantized KV cache (`-ctk q8_0 -ctv q8_0`) to push context further. ## Usage (llama.cpp) ```bash # Instruct (non-thinking) variant — no blocks. llama-server -m Qwen3-30B-A3B-Instruct-2507-asym-2bitexp.gguf -ngl 99 -c 16384 ``` ## Provenance / reproducibility - **Source:** `unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF` at `Q8_0` (near-lossless) as the requantization source (`--allow-requantize`). - **imatrix corpus:** `bartowski/calibration_datav3.txt`, 128 chunks @ ctx 512. - **Tooling:** `llama-quantize` with repeatable `--tensor-type REGEX=TYPE` overrides plus `--token-embedding-type Q6_K --output-tensor-type Q6_K`, base type `IQ3_S`, imatrix-guided. ```bash llama-quantize --allow-requantize --imatrix imatrix.dat \ --tensor-type "ffn_gate_exps=IQ2_S" --tensor-type "ffn_up_exps=IQ2_S" \ --tensor-type "ffn_down_exps=IQ3_S" \ --tensor-type "attn_q=Q4_K" --tensor-type "attn_k=Q4_K" \ --tensor-type "attn_v=Q4_K" --tensor-type "attn_output=Q4_K" \ --token-embedding-type Q6_K --output-tensor-type Q6_K \ src-Q8_0.gguf Qwen3-30B-A3B-Instruct-2507-asym-2bitexp.gguf IQ3_S ``` Coherence verified on a coding task (`merge_intervals`) and a chickens/rabbits reasoning problem (35 heads / 94 legs → 23 chickens, 12 rabbits).