varunmathur commited on
Commit
b5ad90d
·
verified ·
1 Parent(s): 6c2eb2e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +99 -0
README.md ADDED
@@ -0,0 +1,99 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: Qwen/Qwen3-30B-A3B-Instruct-2507
4
+ base_model_relation: quantized
5
+ tags:
6
+ - gguf
7
+ - qwen3moe
8
+ - imatrix
9
+ - asymmetric-quantization
10
+ - 2-bit
11
+ - moe
12
+ pipeline_tag: text-generation
13
+ ---
14
+
15
+ # Qwen3-30B-A3B-Instruct-2507 — Asymmetric 2-bit-Expert GGUF (imatrix)
16
+
17
+ An **asymmetric, expert-aware** quantization of
18
+ [Qwen/Qwen3-30B-A3B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507)
19
+ (arch `qwen3moe`, 128 routed experts, 8 active per token, ~3B active params,
20
+ 48 layers).
21
+
22
+ The idea (the "antirez" insight): in a routed-MoE model the bulk of the weights
23
+ live in the expert FFNs, and most experts are only sparsely active. Push the
24
+ routed experts to **2-bit** where the model is most redundant, keep the
25
+ **attention and the embedding/output weights at higher precision** where error
26
+ is most damaging, and steer the per-tensor bit-allocation with an **importance
27
+ matrix (imatrix)**. The result fits comfortably in **16 GB** with only a modest
28
+ perplexity cost versus the standard 4-bit baseline.
29
+
30
+ ## Asymmetric quantization scheme
31
+
32
+ | Tensor group | Type | Count |
33
+ |---|---|---|
34
+ | Routed expert **gate** (`ffn_gate_exps`) | `IQ2_S` | 48 |
35
+ | Routed expert **up** (`ffn_up_exps`) | `IQ2_S` | 48 |
36
+ | Routed expert **down** (`ffn_down_exps`) | `IQ3_S` | 48 |
37
+ | Attention `q/k/v/output` | `Q4_K` | 192 |
38
+ | `token_embd` | `Q6_K` | 1 |
39
+ | `output` (lm_head) | `Q6_K` | 1 |
40
+
41
+ Notes:
42
+ - **`down` experts get an extra bit (`IQ3_S`)** — they are more error-sensitive
43
+ than `gate`/`up`, so they are protected.
44
+ - This architecture has **no shared expert** — all FFN experts are routed, so
45
+ there is no always-on expert to hold separately at high precision.
46
+ - Quantization was guided by an **imatrix** computed over
47
+ `bartowski/calibration_datav3.txt` (128 chunks, ctx 512). `imatrix.dat` is
48
+ included in this repo.
49
+
50
+ **Effective rate: 2.99 BPW**, on-disk **11.4 GB** (10.14 GiB).
51
+
52
+ ## Quality (perplexity, wikitext-2 raw test, 200 chunks @ ctx 512)
53
+
54
+ | Model | PPL | Δ vs Q4_K_M |
55
+ |---|---|---|
56
+ | This asym 2-bit-expert (2.99 BPW, 11.4 GB) | **7.62** | +0.31 (+4.2%) |
57
+ | Standard `Q4_K_M` (~4.8 BPW, 18.6 GB) | **7.32** | — |
58
+
59
+ PPL measured with the same harness and chunk count for both. Lower is better.
60
+ The asym build trades a small PPL increase for a **~38% smaller** file that
61
+ clears the 16 GB bar.
62
+
63
+ ## 16 GB fit
64
+
65
+ - Weights on disk / in VRAM: **11.4 GB**.
66
+ - KV cache (48 layers, GQA `n_head_kv = 4`, `head_dim = 128`) at f16 is
67
+ ~0.092 MB/token, so a **16K-token** context adds ~**1.5 GB**.
68
+ - 11.4 GB weights + ~1.5 GB KV (16K ctx) + runtime overhead ≈ **~14 GB < 16 GB**. ✅
69
+
70
+ Use a quantized KV cache (`-ctk q8_0 -ctv q8_0`) to push context further.
71
+
72
+ ## Usage (llama.cpp)
73
+
74
+ ```bash
75
+ # Instruct (non-thinking) variant — no <think> blocks.
76
+ llama-server -m Qwen3-30B-A3B-Instruct-2507-asym-2bitexp.gguf -ngl 99 -c 16384
77
+ ```
78
+
79
+ ## Provenance / reproducibility
80
+
81
+ - **Source:** `unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF` at `Q8_0`
82
+ (near-lossless) as the requantization source (`--allow-requantize`).
83
+ - **imatrix corpus:** `bartowski/calibration_datav3.txt`, 128 chunks @ ctx 512.
84
+ - **Tooling:** `llama-quantize` with repeatable `--tensor-type REGEX=TYPE`
85
+ overrides plus `--token-embedding-type Q6_K --output-tensor-type Q6_K`,
86
+ base type `IQ3_S`, imatrix-guided.
87
+
88
+ ```bash
89
+ llama-quantize --allow-requantize --imatrix imatrix.dat \
90
+ --tensor-type "ffn_gate_exps=IQ2_S" --tensor-type "ffn_up_exps=IQ2_S" \
91
+ --tensor-type "ffn_down_exps=IQ3_S" \
92
+ --tensor-type "attn_q=Q4_K" --tensor-type "attn_k=Q4_K" \
93
+ --tensor-type "attn_v=Q4_K" --tensor-type "attn_output=Q4_K" \
94
+ --token-embedding-type Q6_K --output-tensor-type Q6_K \
95
+ src-Q8_0.gguf Qwen3-30B-A3B-Instruct-2507-asym-2bitexp.gguf IQ3_S
96
+ ```
97
+
98
+ Coherence verified on a coding task (`merge_intervals`) and a chickens/rabbits
99
+ reasoning problem (35 heads / 94 legs → 23 chickens, 12 rabbits).