Configuration Parsing Warning:In config.json: "quantization_config.bits" must be an integer
Qwen3-Coder-30B-A3B-Instruct — 4.06 bpw EXL3
EXL3 (trellis / MCG lattice) quantization of Qwen/Qwen3-Coder-30B-A3B-Instruct for use with exllamav3.
Quantization details
| Property | Value |
|---|---|
| Format | EXL3 (trellis/MCG lattice coding) |
| Bits per weight | 4.06 bpw |
| Head bits | 6 |
| Calibration | 250 rows × 2048 cols |
| exllamav3 version | 0.0.34 |
| Shards | 2 × safetensors (~7.8 GB + ~7.4 GB = ~15.2 GB total) |
Hardware requirements
| Config | VRAM needed |
|---|---|
| Weights only (no KV cache) | ~15.2 GB |
| 8 k context — Q4 KV cache | ~15.6 GB |
| 32 k context — Q4 KV cache | ~16.4 GB |
| 256 k context — Q4 KV cache | ~22.4 GB |
Full 256k native context fits on a single RTX 3090 (24 GB) with Q4 KV cache.
Performance (RTX 3090, single GPU)
| Metric | Value |
|---|---|
| Throughput | 106 t/s |
| Time to first token | 0.52 s |
Measured with exllamav3 + a ~250-token prompt, 1800 new tokens, temperature 0.6.
Usage
Install exllamav3:
pip install exllamav3
Minimal inference example:
from exllamav3 import Config, Model, Tokenizer
from exllamav3.generator import Generator, Job
from exllamav3.cache import Cache, CacheLayer_q4
MODEL_DIR = "/path/to/Qwen3-Coder-30B-A3B-4.0bpw-exl3"
config = Config.from_directory(MODEL_DIR)
config.max_seq_len = 8192 # adjust to your context needs (up to 262144)
model = Model.from_config(config)
cache = Cache(model, max_num_tokens=8192, layer_type=CacheLayer_q4) # cache BEFORE load
model.load() # autosplit across available GPUs
tokenizer = Tokenizer.from_config(config)
prompt = "<|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\n"
ids = tokenizer.encode(prompt, add_bos=True)
job = Job(input_ids=ids, max_new_tokens=512)
gen = Generator(model=model, cache=cache, tokenizer=tokenizer)
gen.enqueue(job)
output_ids = []
while gen.num_remaining_jobs():
for result in gen.iterate():
if result.get("token_ids"):
output_ids.extend(result["token_ids"])
print(tokenizer.decode(output_ids))
Important: create
Cache(...)before callingmodel.load(). The cache allocates its KV tensors during the model load pass — reversing the order leaves the KV tensors asNoneand causes a runtime assertion error.
Single-GPU with display apps (autosplit workaround)
If autosplit fails with "Insufficient VRAM" on a machine where a desktop environment is running (GNOME, KDE, etc.), display apps consume GPU graphics memory that the autosplit budget calculator counts against the available pool. Fix: force a single device:
model.load(device="cuda:0") # bypass autosplit budget check
With Q4 KV cache for 256k context
from exllamav3.cache import Cache, CacheLayer_q4
cache = Cache(model, max_num_tokens=262144, layer_type=CacheLayer_q4)
Via TabbyAPI (OpenAI-compatible server)
TabbyAPI supports the exllamav3 backend natively.
Example config.yml for full 256k context on a single 3090:
model:
model_dir: /path/to/models
model_name: Qwen3-Coder-30B-A3B-4.0bpw-exl3
backend: exllamav3
max_seq_len: 262144
cache_size: 262144
cache_mode: Q4
tool_format: qwen3_coder # enables XML → OpenAI tool_calls parsing
About EXL3
EXL3 uses trellis / MCG lattice coding instead of the codebook quantization used in EXL2. At matching bitrates, EXL3 achieves lower per-weight quantization error than EXL2, roughly equivalent to being ~0.1–0.2 bpw higher quality for the same file size.
About the base model
Qwen3-Coder-30B-A3B-Instruct is a Mixture-of-Experts coding model from the Qwen Team:
- 30.5B total parameters, 3.3B activated per token — MoE efficiency
- 48 layers, 32 Q heads / 4 KV heads (GQA), 128 experts / 8 active
- 262,144 token native context (256k), extendable to 1M with YaRN
- No
<think>blocks — direct output, no reasoning preamble overhead - Trained for agentic coding with native function-calling support
License: Apache 2.0
See the original model card and the Qwen3 technical report for full benchmark results.
- Downloads last month
- 30
Model tree for galvani78/Qwen3-Coder-30B-A3B-4.0bpw-EXL3
Base model
Qwen/Qwen3-Coder-30B-A3B-Instruct