--- license: apache-2.0 base_model: Qwen/Qwen3-Omni-30B-A3B-Instruct tags: - vllm - vllm-omni - quantization - nvfp4 - w4a16 - modelopt - qwen3-omni - multimodal --- # Qwen3-Omni-30B-A3B-Instruct — NVFP4 W4A16 (AWQ-clip) ModelOpt NVFP4 W4A16 quantization of [Qwen/Qwen3-Omni-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct), calibrated with NVIDIA TensorRT-Model-Optimizer's `W4A16_NVFP4_CFG` + `awq_clip` algorithm. This checkpoint keeps multimodal capability intact while shrinking the thinker LM by ~57% on disk; multimodal encoders, talker, code2wav remain BF16. ## Accuracy vs BF16 baseline Benchmarked on RTX PRO 6000 Blackwell WS (sm_120), 0-shot greedy, `seed=42`. Daily-Omni accuracy uses the canonical [vllm-omni in-tree harness](https://github.com/vllm-project/vllm-omni/tree/main/tests/e2e/accuracy/qwen3_omni). ### Headline | Benchmark | BF16 | **NVFP4 W4A16 awq** | Δ | |---|---|---|---| | MMLU (n=100) | 0.72 | **0.72** | 0.00 | | GSM8K (n=100) | 0.95 | **0.94** | −0.01 | | Daily-Omni overall (n=50) | 0.72 | **0.72** | 0.00 | Within statistical noise on all three. Passes Daily-Omni's default 0.67 accept gate. ### Daily-Omni by question type | Task type | BF16 | **NVFP4 W4A16 awq** | n | |---|---|---|---| | Reasoning | 1.000 | **1.000** | 11 | | Inference | 1.000 | **0.750** | 4 | | AV Event Alignment | 1.000 | **1.000** | 3 | | Comparative | 0.667 | **0.667** | 9 | | Context understanding | 0.500 | **0.600** | 10 | | Event Sequence | 0.538 | **0.538** | 13 | ### Daily-Omni by video duration | Duration | BF16 | **NVFP4 W4A16 awq** | n | |---|---|---|---| | 30s clips | 0.727 | **0.697** | 33 | | 60s clips | 0.706 | **0.765** | 17 | ## Quantization scope | Layer | State | |---|---| | `thinker.model.*` attention QKV/O | NVFP4 W4A16 | | `thinker.model.*.mlp.experts.*` (gate_proj, up_proj, down_proj) | NVFP4 W4A16 | | `thinker.model.*.mlp.gate` (MoE router) | BF16 (kept full-precision) | | `thinker.audio_tower.*`, `thinker.visual.*`, `thinker.lm_head` | BF16 | | `talker`, `code2wav` | BF16 | ## On-disk format | Tensor | dtype | Notes | |---|---|---| | `*.weight` | uint8 | Packed FP4 E2M1, 2 values per byte | | `*.weight_scale` | float8_e4m3fn | Per-block scale (16-element groups along input dim) | | `*.weight_scale_2` | float32 | Per-tensor scalar scale | Disk size: ~26 GB (vs ~62 GB BF16, ~57% reduction). ## Inference via vllm-omni ```python from vllm_omni import Omni omni = Omni(model="YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq") ``` Or via OpenAI-compatible server: ```bash vllm serve YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq \ --omni --port 8000 ``` Compute requirement: **sm_80+ (Ampere or newer)** for the FP4 Marlin GEMM path. On sm_120 (RTX 5090, Pro 6000) and sm_90 (H100/H200) the dequant-then-BF16 matmul kernel handles W4A16 correctly. > **Note for vLLM ≤ v0.21.x users:** The MoE FP4 Marlin selection that makes this checkpoint work needs vLLM PR #42566. vllm-omni's upcoming PR adds a `patch.py` backport that self-extinguishes once vLLM ≥0.22 ships. ## Calibration recipe - Base model: `Qwen/Qwen3-Omni-30B-A3B-Instruct` in `bfloat16` - ModelOpt: `nvidia-modelopt[torch] >= 0.42` (tested with 0.45.0.dev) - Config: `mtq.W4A16_NVFP4_CFG` + algorithm `awq_clip` - Samples: **1024** (wikitext-103-raw-v1 + openai_humaneval mix; truncated to 512 tokens) - Excluded patterns: `*audio_tower*`, `*visual*`, `*talker*`, `*code2wav*`, `*lm_head*`, `*mlp.gate*` The exported `hf_quant_config.json` enumerates `*mlp.gate*` and `*lm_head*` in `exclude_modules` (added post-export because ModelOpt's exporter silently drops these from the wildcard exclusions when calibration was done correctly — see the vllm-omni PR for the upstream-tracking patch). ## Variant: `-max` A sibling checkpoint calibrated with ModelOpt's simpler `max` algorithm: [`YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max`](https://huggingface.co/YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max). On the same benchmarks: MMLU 0.74, GSM8K 0.92, Daily-Omni 0.74 — slightly different accuracy/cost tradeoff (faster to calibrate, no AWQ pass). ## License Apache-2.0 (inherits from the base Qwen3-Omni-30B-A3B-Instruct model). See the [base model page](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct) for additional usage terms. ## W4A4 NVFP4 siblings Same model, broader quantization surface (weights *and* activations to FP4) for **Blackwell-only** deployments: - [`YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-full-thinker-awqclip`](https://huggingface.co/YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-full-thinker-awqclip) — production W4A4, AWQ-clip text calibration, B200 throughput +33% vs BF16 at conc=32. Requires `vllm-omni` with the load-time NaN clamp from [vllm-project/vllm-omni#4025](https://github.com/vllm-project/vllm-omni/pull/4025) (defensive override, no-op for clean checkpoints).