---
license: apache-2.0
base_model: Qwen/Qwen3-30B-A3B-Instruct-2507
base_model_relation: quantized
tags:
- gguf
- qwen3moe
- imatrix
- asymmetric-quantization
- 2-bit
- moe
pipeline_tag: text-generation
---

# Qwen3-30B-A3B-Instruct-2507 — Asymmetric 2-bit-Expert GGUF (imatrix)

An **asymmetric, expert-aware** quantization of
[Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)
(arch `qwen3moe`, 128 routed experts, 8 active per token, ~3B active params,
48 layers).

The idea (the "antirez" insight): in a routed-MoE model the bulk of the weights
live in the expert FFNs, and most experts are only sparsely active. Push the
routed experts to **2-bit** where the model is most redundant, keep the
**attention and the embedding/output weights at higher precision** where error
is most damaging, and steer the per-tensor bit-allocation with an **importance
matrix (imatrix)**. The result fits comfortably in **16 GB** with only a modest
perplexity cost versus the standard 4-bit baseline.

## Asymmetric quantization scheme

| Tensor group | Type | Count |
|---|---|---|
| Routed expert **gate** (`ffn_gate_exps`) | `IQ2_S` | 48 |
| Routed expert **up** (`ffn_up_exps`)     | `IQ2_S` | 48 |
| Routed expert **down** (`ffn_down_exps`) | `IQ3_S` | 48 |
| Attention `q/k/v/output`                 | `Q4_K`  | 192 |
| `token_embd`                             | `Q6_K`  | 1 |
| `output` (lm_head)                       | `Q6_K`  | 1 |

Notes:
- **`down` experts get an extra bit (`IQ3_S`)** — they are more error-sensitive
  than `gate`/`up`, so they are protected.
- This architecture has **no shared expert** — all FFN experts are routed, so
  there is no always-on expert to hold separately at high precision.
- Quantization was guided by an **imatrix** computed over
  `bartowski/calibration_datav3.txt` (128 chunks, ctx 512). `imatrix.dat` is
  included in this repo.

**Effective rate: 2.99 BPW**, on-disk **11.4 GB** (10.14 GiB).

## Quality (perplexity, wikitext-2 raw test, 200 chunks @ ctx 512)

| Model | PPL | Δ vs Q4_K_M |
|---|---|---|
| This asym 2-bit-expert (2.99 BPW, 11.4 GB) | **7.62** | +0.31 (+4.2%) |
| Standard `Q4_K_M` (~4.8 BPW, 18.6 GB) | **7.32** | — |

PPL measured with the same harness and chunk count for both. Lower is better.
The asym build trades a small PPL increase for a **~38% smaller** file that
clears the 16 GB bar.

## 16 GB fit

- Weights on disk / in VRAM: **11.4 GB**.
- KV cache (48 layers, GQA `n_head_kv = 4`, `head_dim = 128`) at f16 is
  ~0.092 MB/token, so a **16K-token** context adds ~**1.5 GB**.
- 11.4 GB weights + ~1.5 GB KV (16K ctx) + runtime overhead ≈ **~14 GB < 16 GB**. ✅

Use a quantized KV cache (`-ctk q8_0 -ctv q8_0`) to push context further.

## Usage (llama.cpp)

```bash
# Instruct (non-thinking) variant — no <think> blocks.
llama-server -m Qwen3-30B-A3B-Instruct-2507-asym-2bitexp.gguf -ngl 99 -c 16384
```

## Provenance / reproducibility

- **Source:** `unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF` at `Q8_0`
  (near-lossless) as the requantization source (`--allow-requantize`).
- **imatrix corpus:** `bartowski/calibration_datav3.txt`, 128 chunks @ ctx 512.
- **Tooling:** `llama-quantize` with repeatable `--tensor-type REGEX=TYPE`
  overrides plus `--token-embedding-type Q6_K --output-tensor-type Q6_K`,
  base type `IQ3_S`, imatrix-guided.

```bash
llama-quantize --allow-requantize --imatrix imatrix.dat \
  --tensor-type "ffn_gate_exps=IQ2_S" --tensor-type "ffn_up_exps=IQ2_S" \
  --tensor-type "ffn_down_exps=IQ3_S" \
  --tensor-type "attn_q=Q4_K" --tensor-type "attn_k=Q4_K" \
  --tensor-type "attn_v=Q4_K" --tensor-type "attn_output=Q4_K" \
  --token-embedding-type Q6_K --output-tensor-type Q6_K \
  src-Q8_0.gguf Qwen3-30B-A3B-Instruct-2507-asym-2bitexp.gguf IQ3_S
```

Coherence verified on a coding task (`merge_intervals`) and a chickens/rabbits
reasoning problem (35 heads / 94 legs → 23 chickens, 12 rabbits).