---
license: apache-2.0
base_model: Qwen/Qwen3-Omni-30B-A3B-Instruct
tags:
  - vllm
  - vllm-omni
  - quantization
  - nvfp4
  - w4a16
  - modelopt
  - qwen3-omni
  - multimodal
---

# Qwen3-Omni-30B-A3B-Instruct — NVFP4 W4A16 (AWQ-clip)

ModelOpt NVFP4 W4A16 quantization of [Qwen/Qwen3-Omni-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct), calibrated with NVIDIA TensorRT-Model-Optimizer's `W4A16_NVFP4_CFG` + `awq_clip` algorithm.

This checkpoint keeps multimodal capability intact while shrinking the thinker LM by ~57% on disk; multimodal encoders, talker, code2wav remain BF16.

## Accuracy vs BF16 baseline

Benchmarked on RTX PRO 6000 Blackwell WS (sm_120), 0-shot greedy, `seed=42`. Daily-Omni accuracy uses the canonical [vllm-omni in-tree harness](https://github.com/vllm-project/vllm-omni/tree/main/tests/e2e/accuracy/qwen3_omni).

### Headline

| Benchmark | BF16 | **NVFP4 W4A16 awq** | Δ |
|---|---|---|---|
| MMLU (n=100) | 0.72 | **0.72** | 0.00 |
| GSM8K (n=100) | 0.95 | **0.94** | −0.01 |
| Daily-Omni overall (n=50) | 0.72 | **0.72** | 0.00 |

Within statistical noise on all three. Passes Daily-Omni's default 0.67 accept gate.

### Daily-Omni by question type

| Task type | BF16 | **NVFP4 W4A16 awq** | n |
|---|---|---|---|
| Reasoning | 1.000 | **1.000** | 11 |
| Inference | 1.000 | **0.750** | 4 |
| AV Event Alignment | 1.000 | **1.000** | 3 |
| Comparative | 0.667 | **0.667** | 9 |
| Context understanding | 0.500 | **0.600** | 10 |
| Event Sequence | 0.538 | **0.538** | 13 |

### Daily-Omni by video duration

| Duration | BF16 | **NVFP4 W4A16 awq** | n |
|---|---|---|---|
| 30s clips | 0.727 | **0.697** | 33 |
| 60s clips | 0.706 | **0.765** | 17 |

## Quantization scope

| Layer | State |
|---|---|
| `thinker.model.*` attention QKV/O | NVFP4 W4A16 |
| `thinker.model.*.mlp.experts.*` (gate_proj, up_proj, down_proj) | NVFP4 W4A16 |
| `thinker.model.*.mlp.gate` (MoE router) | BF16 (kept full-precision) |
| `thinker.audio_tower.*`, `thinker.visual.*`, `thinker.lm_head` | BF16 |
| `talker`, `code2wav` | BF16 |

## On-disk format

| Tensor | dtype | Notes |
|---|---|---|
| `*.weight` | uint8 | Packed FP4 E2M1, 2 values per byte |
| `*.weight_scale` | float8_e4m3fn | Per-block scale (16-element groups along input dim) |
| `*.weight_scale_2` | float32 | Per-tensor scalar scale |

Disk size: ~26 GB (vs ~62 GB BF16, ~57% reduction).

## Inference via vllm-omni

```python
from vllm_omni import Omni
omni = Omni(model="YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq")
```

Or via OpenAI-compatible server:

```bash
vllm serve YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-awq \
    --omni --port 8000
```

Compute requirement: **sm_80+ (Ampere or newer)** for the FP4 Marlin GEMM path. On sm_120 (RTX 5090, Pro 6000) and sm_90 (H100/H200) the dequant-then-BF16 matmul kernel handles W4A16 correctly.

> **Note for vLLM ≤ v0.21.x users:** The MoE FP4 Marlin selection that makes this checkpoint work needs vLLM PR #42566. vllm-omni's upcoming PR adds a `patch.py` backport that self-extinguishes once vLLM ≥0.22 ships.

## Calibration recipe

- Base model: `Qwen/Qwen3-Omni-30B-A3B-Instruct` in `bfloat16`
- ModelOpt: `nvidia-modelopt[torch] >= 0.42` (tested with 0.45.0.dev)
- Config: `mtq.W4A16_NVFP4_CFG` + algorithm `awq_clip`
- Samples: **1024** (wikitext-103-raw-v1 + openai_humaneval mix; truncated to 512 tokens)
- Excluded patterns: `*audio_tower*`, `*visual*`, `*talker*`, `*code2wav*`, `*lm_head*`, `*mlp.gate*`

The exported `hf_quant_config.json` enumerates `*mlp.gate*` and `*lm_head*` in `exclude_modules` (added post-export because ModelOpt's exporter silently drops these from the wildcard exclusions when calibration was done correctly — see the vllm-omni PR for the upstream-tracking patch).

## Variant: `-max`

A sibling checkpoint calibrated with ModelOpt's simpler `max` algorithm:
[`YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max`](https://huggingface.co/YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A16-max).

On the same benchmarks: MMLU 0.74, GSM8K 0.92, Daily-Omni 0.74 — slightly different accuracy/cost tradeoff (faster to calibrate, no AWQ pass).

## License

Apache-2.0 (inherits from the base Qwen3-Omni-30B-A3B-Instruct model). See the [base model page](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct) for additional usage terms.

## W4A4 NVFP4 siblings

Same model, broader quantization surface (weights *and* activations to FP4) for **Blackwell-only** deployments:

- [`YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-full-thinker-awqclip`](https://huggingface.co/YihongJin/Qwen3-Omni-30B-A3B-Instruct-NVFP4-W4A4-full-thinker-awqclip) — production W4A4, AWQ-clip text calibration, B200 throughput +33% vs BF16 at conc=32. Requires `vllm-omni` with the load-time NaN clamp from [vllm-project/vllm-omni#4025](https://github.com/vllm-project/vllm-omni/pull/4025) (defensive override, no-op for clean checkpoints).