--- license: apache-2.0 base_model: Qwen/Qwen3.6-35B-A3B base_model_relation: quantized tags: - nvfp4 - fp4 - llm-compressor - compressed-tensors - vllm - moe - qwen3_5_moe language: - en pipeline_tag: text-generation --- # Qwen3.6-35B-A3B-NVFP4 (self-quantized, llm-compressor) NVFP4 (4-bit) quantization of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B), the hybrid **Gated-DeltaNet + 256-expert MoE** model (35B total / 3B active, multimodal, thinking-by-default). Produced in-house with **llm-compressor / compressed-tensors**, tuned for **NVIDIA Blackwell (sm120, RTX PRO 6000)**. **22.5 GB** (≈3× smaller than the 67 GB BF16 base) — the **smallest** of the public NVFP4 builds, while matching or beating them on quality and beating two of three on speed. ## Recipe (mixed-precision) | Component | Precision | |---|---| | Routed experts (256/layer, fused) | **NVFP4 weight-only (W4A16)** group-16 | | Self-attention q/k/v/o + GDN `in_proj_*`/`out_proj` + shared-expert | **FP8** (W8A8, block-128 weight / dynamic group-128 act) | | Routers, `lm_head`, embeddings, conv1d/SSM, vision tower, MTP | **BF16** | Why weight-only NVFP4 on experts: on sm120 the native FP4 MoE GEMM is unavailable, so all NVFP4 experts serve via the **Marlin W4A16** path regardless — W4A16 therefore gives the same speed as W4A4 with less quantization error. Calibrated with `moe_calibrate_all_experts=True` (every one of the 256 experts receives stats). ## Benchmarks (measured on RTX PRO 6000 / sm120, vLLM 0.23) lm-eval (thinking-on, `max_gen_toks=8192`, flexible-extract); speed from engine `/metrics`, TP1 solo. | Build | MMLU-Pro | GSM8K | single-stream tok/s | N16 tok/s | size | |---|---|---|---|---|---| | **this model (our self-quant)** | **0.825** | **0.920** | 200.6 | 1581 | **22.5 GB** | | unsloth/…-NVFP4 | 0.825 | 0.890 | 175.3 | 1493 | 24.7 GB | | RedHatAI/…-NVFP4 | 0.819 | 0.910 | 170.0 | 1422 | 24.0 GB | | nvidia/…-NVFP4 | 0.817 | 0.910 | **223.6** | **1646** | 23.4 GB | ![benchmark](benchmark.png) - **Pareto-dominates** RedHatAI & unsloth on quality, speed, *and* size. - **Tied-best quality** (top GSM8K, tied-top MMLU-Pro); **smallest** build. - nvidia keeps the single-stream/concurrent speed crown (it sits on the sm120 hardware optimum — FP8-attn + W4A16-experts via Marlin); this build matches its scheme and trails only on raw decode throughput. ## Serving (vLLM ≥ 0.23) ```bash vllm serve --served-model-name qwen3.6-35b-a3b-nvfp4 \ --max-model-len 262144 --gpu-memory-utilization 0.90 \ --trust-remote-code --reasoning-parser qwen3 \ --enable-auto-tool-choice --tool-call-parser qwen3_xml ``` Quantized by `kyaky` with llm-compressor. Base model © Qwen, Apache-2.0.