---
license: apache-2.0
base_model: Qwen/Qwen3.6-35B-A3B
base_model_relation: quantized
tags:
- nvfp4
- fp4
- llm-compressor
- compressed-tensors
- vllm
- moe
- qwen3_5_moe
language:
- en
pipeline_tag: text-generation
---

# Qwen3.6-35B-A3B-NVFP4 (self-quantized, llm-compressor)

NVFP4 (4-bit) quantization of [Qwen/Qwen3.6-35B-A3B](https://huggingface.co/Qwen/Qwen3.6-35B-A3B), the hybrid
**Gated-DeltaNet + 256-expert MoE** model (35B total / 3B active, multimodal, thinking-by-default).
Produced in-house with **llm-compressor / compressed-tensors**, tuned for **NVIDIA Blackwell (sm120, RTX PRO 6000)**.

**22.5 GB** (≈3× smaller than the 67 GB BF16 base) — the **smallest** of the public NVFP4 builds, while matching or
beating them on quality and beating two of three on speed.

## Recipe (mixed-precision)

| Component | Precision |
|---|---|
| Routed experts (256/layer, fused) | **NVFP4 weight-only (W4A16)** group-16 |
| Self-attention q/k/v/o + GDN `in_proj_*`/`out_proj` + shared-expert | **FP8** (W8A8, block-128 weight / dynamic group-128 act) |
| Routers, `lm_head`, embeddings, conv1d/SSM, vision tower, MTP | **BF16** |

Why weight-only NVFP4 on experts: on sm120 the native FP4 MoE GEMM is unavailable, so all NVFP4 experts serve via the
**Marlin W4A16** path regardless — W4A16 therefore gives the same speed as W4A4 with less quantization error.
Calibrated with `moe_calibrate_all_experts=True` (every one of the 256 experts receives stats).

## Benchmarks (measured on RTX PRO 6000 / sm120, vLLM 0.23)

lm-eval (thinking-on, `max_gen_toks=8192`, flexible-extract); speed from engine `/metrics`, TP1 solo.

| Build | MMLU-Pro | GSM8K | single-stream tok/s | N16 tok/s | size |
|---|---|---|---|---|---|
| **this model (our self-quant)** | **0.825** | **0.920** | 200.6 | 1581 | **22.5 GB** |
| unsloth/…-NVFP4 | 0.825 | 0.890 | 175.3 | 1493 | 24.7 GB |
| RedHatAI/…-NVFP4 | 0.819 | 0.910 | 170.0 | 1422 | 24.0 GB |
| nvidia/…-NVFP4 | 0.817 | 0.910 | **223.6** | **1646** | 23.4 GB |

![benchmark](benchmark.png)

- **Pareto-dominates** RedHatAI & unsloth on quality, speed, *and* size.
- **Tied-best quality** (top GSM8K, tied-top MMLU-Pro); **smallest** build.
- nvidia keeps the single-stream/concurrent speed crown (it sits on the sm120 hardware optimum — FP8-attn + W4A16-experts via Marlin); this build matches its scheme and trails only on raw decode throughput.

## Serving (vLLM ≥ 0.23)

```bash
vllm serve <this-repo> --served-model-name qwen3.6-35b-a3b-nvfp4 \
  --max-model-len 262144 --gpu-memory-utilization 0.90 \
  --trust-remote-code --reasoning-parser qwen3 \
  --enable-auto-tool-choice --tool-call-parser qwen3_xml
```

Quantized by `kyaky` with llm-compressor. Base model © Qwen, Apache-2.0.