Qwen3-30B-A3B-Instruct-2507 โ€” Asymmetric 2-bit-Expert GGUF (imatrix)

An asymmetric, expert-aware quantization of Qwen/Qwen3-30B-A3B-Instruct-2507 (arch qwen3moe, 128 routed experts, 8 active per token, ~3B active params, 48 layers).

The idea (the "antirez" insight): in a routed-MoE model the bulk of the weights live in the expert FFNs, and most experts are only sparsely active. Push the routed experts to 2-bit where the model is most redundant, keep the attention and the embedding/output weights at higher precision where error is most damaging, and steer the per-tensor bit-allocation with an importance matrix (imatrix). The result fits comfortably in 16 GB with only a modest perplexity cost versus the standard 4-bit baseline.

Asymmetric quantization scheme

Tensor group Type Count
Routed expert gate (ffn_gate_exps) IQ2_S 48
Routed expert up (ffn_up_exps) IQ2_S 48
Routed expert down (ffn_down_exps) IQ3_S 48
Attention q/k/v/output Q4_K 192
token_embd Q6_K 1
output (lm_head) Q6_K 1

Notes:

  • down experts get an extra bit (IQ3_S) โ€” they are more error-sensitive than gate/up, so they are protected.
  • This architecture has no shared expert โ€” all FFN experts are routed, so there is no always-on expert to hold separately at high precision.
  • Quantization was guided by an imatrix computed over bartowski/calibration_datav3.txt (128 chunks, ctx 512). imatrix.dat is included in this repo.

Effective rate: 2.99 BPW, on-disk 11.4 GB (10.14 GiB).

Quality (perplexity, wikitext-2 raw test, 200 chunks @ ctx 512)

Model PPL ฮ” vs Q4_K_M
This asym 2-bit-expert (2.99 BPW, 11.4 GB) 7.62 +0.31 (+4.2%)
Standard Q4_K_M (~4.8 BPW, 18.6 GB) 7.32 โ€”

PPL measured with the same harness and chunk count for both. Lower is better. The asym build trades a small PPL increase for a ~38% smaller file that clears the 16 GB bar.

16 GB fit

  • Weights on disk / in VRAM: 11.4 GB.
  • KV cache (48 layers, GQA n_head_kv = 4, head_dim = 128) at f16 is ~0.092 MB/token, so a 16K-token context adds ~1.5 GB.
  • 11.4 GB weights + 1.5 GB KV (16K ctx) + runtime overhead โ‰ˆ **14 GB < 16 GB**. โœ…

Use a quantized KV cache (-ctk q8_0 -ctv q8_0) to push context further.

Usage (llama.cpp)

# Instruct (non-thinking) variant โ€” no <think> blocks.
llama-server -m Qwen3-30B-A3B-Instruct-2507-asym-2bitexp.gguf -ngl 99 -c 16384

Provenance / reproducibility

  • Source: unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF at Q8_0 (near-lossless) as the requantization source (--allow-requantize).
  • imatrix corpus: bartowski/calibration_datav3.txt, 128 chunks @ ctx 512.
  • Tooling: llama-quantize with repeatable --tensor-type REGEX=TYPE overrides plus --token-embedding-type Q6_K --output-tensor-type Q6_K, base type IQ3_S, imatrix-guided.
llama-quantize --allow-requantize --imatrix imatrix.dat \
  --tensor-type "ffn_gate_exps=IQ2_S" --tensor-type "ffn_up_exps=IQ2_S" \
  --tensor-type "ffn_down_exps=IQ3_S" \
  --tensor-type "attn_q=Q4_K" --tensor-type "attn_k=Q4_K" \
  --tensor-type "attn_v=Q4_K" --tensor-type "attn_output=Q4_K" \
  --token-embedding-type Q6_K --output-tensor-type Q6_K \
  src-Q8_0.gguf Qwen3-30B-A3B-Instruct-2507-asym-2bitexp.gguf IQ3_S

Coherence verified on a coding task (merge_intervals) and a chickens/rabbits reasoning problem (35 heads / 94 legs โ†’ 23 chickens, 12 rabbits).

Downloads last month
173
GGUF
Model size
31B params
Architecture
qwen3moe
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for hyperspaceai/Qwen3-30B-A3B-Instruct-2507-asym-2bitexp-GGUF

Quantized
(127)
this model