Qwen3-Coder-30B-A3B โ€” Asymmetric 2-bit-Expert GGUF (imatrix)

An asymmetric, expert-aware quantization of Qwen/Qwen3-Coder-30B-A3B-Instruct (arch qwen3moe, 128 routed experts, 8 active per token, ~3B active params).

The idea (the "antirez" insight): in a routed-MoE model the bulk of the weights live in the expert FFNs, and most experts are only sparsely active. Push the routed experts to 2-bit where the model is most redundant, keep the attention and the embedding/output tied weights at higher precision where error is most damaging, and steer the per-tensor bit-allocation with an importance matrix (imatrix). The result fits comfortably in 16 GB with only a modest perplexity cost versus the standard 4-bit baseline.

Asymmetric quantization scheme

Tensor group Type Bits Count
Routed expert gate (ffn_gate_exps) IQ2_S ~2.5 48
Routed expert up (ffn_up_exps) IQ2_S ~2.5 48
Routed expert down (ffn_down_exps) IQ3_S ~3.44 48
Attention q/k/v/output Q4_K ~4.5 192
token_embd Q6_K ~6.6 1
output (lm_head) Q6_K ~6.6 1

Notes:

  • down experts get an extra bit (IQ3_S) โ€” they are more error-sensitive than gate/up, so they are protected.
  • This model architecture has no shared expert (expert_shared_feed_forward_length = 0), so there is no always-on expert to hold at high precision โ€” all FFN experts are routed.
  • Quantization was guided by an imatrix computed over bartowski/calibration_datav3.txt (128 chunks, ctx 512). imatrix.dat is included in this repo.

Effective rate: 2.99 BPW, on-disk 11.4 GB (10.64 GiB).

Quality (perplexity, wikitext-2 raw test, 200 chunks @ ctx 512)

Model PPL ฮ” vs Q4_K_M
This asym 2-bit-expert (2.99 BPW, 11.4 GB) 9.93 +0.33 (+3.4%)
Standard Q4_K_M (~4.8 BPW, 18.6 GB) 9.60 โ€”

PPL measured with the same harness and chunk count for both. Lower is better. The asym build trades a small PPL increase for a ~39% smaller file that clears the 16 GB bar.

16 GB fit

  • Weights on disk / in VRAM: 11.4 GB.
  • KV cache (this arch: 48 layers, GQA, n_head_kv small) at f16 is on the order of ~0.13 GB per 1K tokens, so a 16K-token context adds roughly ~2 GB.
  • 11.4 GB weights + 2 GB KV (16K ctx) + runtime overhead โ‰ˆ **14 GB < 16 GB**. โœ…

Use a quantized KV cache (-ctk q8_0 -ctv q8_0) to push context further.

Usage (llama.cpp)

# This is the Instruct (non-thinking) variant โ€” no <think> blocks.
llama-server -m Qwen3-Coder-30B-A3B-asym-2bitexp.gguf -ngl 99 -c 16384

Provenance / reproducibility

  • Source: unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF at Q8_0 (near-lossless) as the requantization source (--allow-requantize).
  • imatrix corpus: bartowski/calibration_datav3.txt, 128 chunks @ ctx 512.
  • Tooling: llama-quantize with repeatable --tensor-type REGEX=TYPE overrides plus --token-embedding-type Q6_K --output-tensor-type Q6_K, base type IQ3_S, imatrix-guided.

Coherence verified on a coding task (correct merge_intervals) and a multi-step word problem (35 heads / 94 legs โ†’ 23 chickens, 12 rabbits, with a correct check).

Downloads last month
238
GGUF
Model size
31B params
Architecture
qwen3moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for hyperspaceai/Qwen3-Coder-30B-A3B-asym-2bitexp-GGUF

Quantized
(148)
this model