Configuration Parsing Warning:In config.json: "quantization_config.bits" must be an integer

Qwen3-Coder-30B-A3B-Instruct — 4.06 bpw EXL3

EXL3 (trellis / MCG lattice) quantization of Qwen/Qwen3-Coder-30B-A3B-Instruct for use with exllamav3.

Quantization details

Property Value
Format EXL3 (trellis/MCG lattice coding)
Bits per weight 4.06 bpw
Head bits 6
Calibration 250 rows × 2048 cols
exllamav3 version 0.0.34
Shards 2 × safetensors (~7.8 GB + ~7.4 GB = ~15.2 GB total)

Hardware requirements

Config VRAM needed
Weights only (no KV cache) ~15.2 GB
8 k context — Q4 KV cache ~15.6 GB
32 k context — Q4 KV cache ~16.4 GB
256 k context — Q4 KV cache ~22.4 GB

Full 256k native context fits on a single RTX 3090 (24 GB) with Q4 KV cache.

Performance (RTX 3090, single GPU)

Metric Value
Throughput 106 t/s
Time to first token 0.52 s

Measured with exllamav3 + a ~250-token prompt, 1800 new tokens, temperature 0.6.

Usage

Install exllamav3:

pip install exllamav3

Minimal inference example:

from exllamav3 import Config, Model, Tokenizer
from exllamav3.generator import Generator, Job
from exllamav3.cache import Cache, CacheLayer_q4

MODEL_DIR = "/path/to/Qwen3-Coder-30B-A3B-4.0bpw-exl3"

config = Config.from_directory(MODEL_DIR)
config.max_seq_len = 8192          # adjust to your context needs (up to 262144)

model = Model.from_config(config)
cache = Cache(model, max_num_tokens=8192, layer_type=CacheLayer_q4)  # cache BEFORE load

model.load()                       # autosplit across available GPUs
tokenizer = Tokenizer.from_config(config)

prompt = "<|im_start|>user\nHello!<|im_end|>\n<|im_start|>assistant\n"
ids = tokenizer.encode(prompt, add_bos=True)

job = Job(input_ids=ids, max_new_tokens=512)
gen = Generator(model=model, cache=cache, tokenizer=tokenizer)
gen.enqueue(job)

output_ids = []
while gen.num_remaining_jobs():
    for result in gen.iterate():
        if result.get("token_ids"):
            output_ids.extend(result["token_ids"])

print(tokenizer.decode(output_ids))

Important: create Cache(...) before calling model.load(). The cache allocates its KV tensors during the model load pass — reversing the order leaves the KV tensors as None and causes a runtime assertion error.

Single-GPU with display apps (autosplit workaround)

If autosplit fails with "Insufficient VRAM" on a machine where a desktop environment is running (GNOME, KDE, etc.), display apps consume GPU graphics memory that the autosplit budget calculator counts against the available pool. Fix: force a single device:

model.load(device="cuda:0")   # bypass autosplit budget check

With Q4 KV cache for 256k context

from exllamav3.cache import Cache, CacheLayer_q4

cache = Cache(model, max_num_tokens=262144, layer_type=CacheLayer_q4)

Via TabbyAPI (OpenAI-compatible server)

TabbyAPI supports the exllamav3 backend natively. Example config.yml for full 256k context on a single 3090:

model:
  model_dir: /path/to/models
  model_name: Qwen3-Coder-30B-A3B-4.0bpw-exl3
  backend: exllamav3
  max_seq_len: 262144
  cache_size: 262144
  cache_mode: Q4
  tool_format: qwen3_coder   # enables XML → OpenAI tool_calls parsing

About EXL3

EXL3 uses trellis / MCG lattice coding instead of the codebook quantization used in EXL2. At matching bitrates, EXL3 achieves lower per-weight quantization error than EXL2, roughly equivalent to being ~0.1–0.2 bpw higher quality for the same file size.

About the base model

Qwen3-Coder-30B-A3B-Instruct is a Mixture-of-Experts coding model from the Qwen Team:

  • 30.5B total parameters, 3.3B activated per token — MoE efficiency
  • 48 layers, 32 Q heads / 4 KV heads (GQA), 128 experts / 8 active
  • 262,144 token native context (256k), extendable to 1M with YaRN
  • No <think> blocks — direct output, no reasoning preamble overhead
  • Trained for agentic coding with native function-calling support

License: Apache 2.0

See the original model card and the Qwen3 technical report for full benchmark results.

Downloads last month
30
Safetensors
Model size
8B params
Tensor type
BF16
·
F16
·
I16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for galvani78/Qwen3-Coder-30B-A3B-4.0bpw-EXL3

Quantized
(146)
this model

Paper for galvani78/Qwen3-Coder-30B-A3B-4.0bpw-EXL3