Qwen-AgentWorld-35B-A3B-NVFP4

NVFP4 (mixed-precision, compressed-tensors) quantization of Qwen/Qwen-AgentWorld-35B-A3B, produced with llm-compressor for fast inference with vLLM on NVIDIA Blackwell (sm120 / RTX PRO 6000).

Benchmark vs official BF16

NVFP4 vs BF16 benchmark

Measured on the same hardware (1× RTX PRO 6000 Blackwell) and identical vLLM 0.23 config (--max-model-len 8192 --max-num-seqs 64 --gpu-memory-utilization 0.90, temperature=0):

Metric Official BF16 NVFP4 (this model) Δ
Disk size 66 GB 24.96 GB −62%
First-token latency (TTFT) 35 ms 32 ms −8.6%
Single-stream decode 157.9 tok/s 184.1 tok/s +16.6%
Concurrent throughput (N=16) 1351.6 tok/s 1430.5 tok/s +5.8%

Quality (temperature=0): 17×23, a factorial function, "why is the sky blue", and echo $((6*7)) — all correct and equivalent to the BF16 reference. The NVFP4 build is faster and ~1/3 the size with matching quality.

Quantization

  • Tool: llm-compressor 0.12, compressed-tensors (format mixed-precision).
  • Scheme:
    • Attention (self_attn.{q,k,v,o}_proj, GDN linear_attn.{in_proj_qkv,in_proj_z,out_proj}) → FP8 (block [128,128]).
    • MoE experts (gate_proj/up_proj/down_proj) → NVFP4 (group size 16, fp8_e4m3 scales).
    • Left in BF16 (ignored): lm_head, embed_tokens, router mlp.gate, shared expert, GDN state (linear_attn.{A_log,conv1d,in_proj_a,in_proj_b}), first/last MoE layer experts, and the vision tower.
  • MoE calibration: all 256 experts calibrated (moe_calibrate_all_experts=True) on HuggingFaceH4/ultrachat_200k (256 samples, seq len 2048).
  • Size: ~25 GB (down from ~70 GB BF16). No MTP / speculative module.

Serving (vLLM)

Text-only deployment (the source defines a vision tower but sets language_model_only=true). vLLM auto-detects compressed-tensors — no --quantization flag needed.

vllm serve kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 \
  --max-model-len 262144 --max-num-batched-tokens 2096 --max-num-seqs 256 \
  --enable-prefix-caching --disable-custom-all-reduce --trust-remote-code \
  --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_xml

Notes: hybrid Gated-DeltaNet attention requires --max-num-batched-tokens 2096 (cache alignment); head_dim=256 auto-selects the Triton attention backend.

License

Apache-2.0 (inherited from the base model).

Downloads last month
100
Safetensors
Model size
22B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kyaky/Qwen-AgentWorld-35B-A3B-NVFP4

Quantized
(34)
this model