Qwen-AgentWorld-35B-A3B GGUF

GGUF quantizations of Qwen/Qwen-AgentWorld-35B-A3B generated with llama.cpp.

  • Architecture: qwen35moe (35B params, 256 experts / 8 active, A3B)
  • Context length: 262 144
  • Source: BF16 GGUF converted via convert_hf_to_gguf.py
  • Importance matrix (coding.imatrix.gguf) generated from a curated coding calibration text (1 840 samples), 128 chunks, ctx 4 096.

Files

File Size (GiB) BPW Use
Qwen-AgentWorld-35B-A3B-BF16.gguf 64.61 16.01 reference / highest fidelity
Qwen-AgentWorld-35B-A3B-Q8_0.gguf 34.37 8.52 near-lossless, needs >24 GB VRAM with all layers offloaded
Qwen-AgentWorld-35B-A3B-Q6_K.gguf 26.56 6.58 very high quality, ~20 GB VRAM
Qwen-AgentWorld-35B-A3B-Q4_K_M.gguf 19.71 4.88 balanced quality / size, ~14 GB VRAM
Qwen-AgentWorld-35B-A3B-Q2_K.gguf 12.05 2.99 smallest, ~9 GB VRAM, quality trade-off
Qwen-AgentWorld-35B-A3B-coding.imatrix.gguf 0.18 importance matrix for finer quant recipes (Tensor-type overrides).

Plain quantization passes (no recipe overrides) only — these avoid the std::bad_alloc triggered by --tensor-type-file + imatrix on this specific MoE architecture in the current llama.cpp build (8194 / 1179bfc82). The imatrix file is still provided for users who want to mix types via --tensor-type-file.

Recommended inference

llama.cpp/build/bin/llama-server \
    -m Qwen-AgentWorld-35B-A3B-Q4_K_M.gguf \
    -ngl 999 --ctx-size 8192 -b 2048 -ub 512 -np 1 \
    --temp 0.6 --top-p 0.95 --top-k 20 \
    -fa --host 0.0.0.0 --port 8080
Downloads last month
12,186
GGUF
Model size
35B params
Architecture
qwen35moe
Hardware compatibility
Log In to add your hardware

2-bit

4-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for groxaxo/Qwen-AgentWorld-35B-A3B-GGUF

Quantized
(31)
this model