YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen FP8 to NVFP4 Quantization Pipeline

Convert Qwen-Rapid-AIO diffusion models from FP8 to NVFP4 format for faster inference on RTX 5090 (Blackwell) with custom calibration to preserve image quality.

Source Model: Qwen-Rapid-AIO-NSFW-v14.1 (27GB, FP8, LoRA-fused) Inference Stack: ComfyUI on RTX 5090 with native NVFP4 support Recommended Model: V3_NVFP4_custom_p999 ... best cost-performance balance

Why This Exists

The official comfy-dit-quantizer assumes BF16 source models. Our source is already FP8, which means:

FP8 weights must be upcast to FP32 before re-quantizing to NVFP4 (double quantization)
Text encoder FP8 layers are missing weight_scale tensors
Key names need normalization from model.diffusion_model.* to diffusers format
Base calibration data doesn't match our LoRA-merged model's activation distributions

This pipeline solves all four problems.

Recommended Model: V3_NVFP4_custom_p999

After benchmarking all versions across 100 prompts, V3 delivers the best cost-performance ratio: 1.84x speedup over FP8 baseline with the tightest calibration scales.

Model	Avg Time/Image	Speedup	NVFP4 Layers	FP8 Layers	Key Trait
FP8 Baseline	8.38s	1.00x	0	All	Reference quality
V2 (custom calib)	4.52s	1.85x	360	599	First custom calibration
V3 (p99.9 calib)	4.55s	1.84x	360	599	Best cost-performance
V4 (selective)	6.05s	1.39x	240	719	Better quality, slower
V5 (text quality)	8.39s	1.00x	180	779	Best quality, no speedup

V3 uses P99.9 percentile calibration from 53 sample generations with skin-focused prompts. The p99.9 method produces ~74% tighter scales than max aggregation, giving NVFP4 layers better dynamic range utilization without being dominated by rare outlier activations.

V4 and V5 trade speed for quality by keeping more layers at FP8. V5 in particular loses all speedup benefit, making it impractical for production use.

The Problem: Quality Degradation in NVFP4

Converting FP8 to NVFP4 introduces two compounding quality issues:

1. Double Quantization

The pipeline stacks two aggressive quantizations, losing ~5+ bits of precision from the original BF16:

Original BF16 (7 mantissa bits)
    -> FP8 E4M3 (3 mantissa bits)     ~4 bits lost
    -> Upcast to FP32                  No recovery, same 3 bits of real info
    -> NVFP4 (2 bits, 4 quant levels)  ~5+ bits total lost from original

The upcast to FP32 does NOT recover original precision. It just represents already-degraded FP8 values in a larger format.

Constraint: We cannot access the original BF16 weights or the original FP8 quantization scales. The Qwen-Rapid-AIO models are only distributed in FP8 format.

2. Calibration Mismatch

The base calibration (calibs/qwen-image-edit-2511.json) was generated from the original qwen-image-edit-2511 model, not our LoRA-merged variant. When LoRAs are merged (W' = W + BA), weight magnitudes and activation ranges shift significantly. Our custom calibration values were ~20x larger than the base calibration, confirming the mismatch.

Mismatch Type	Effect on Output
Scale too large	Activations quantized toward zero (underflow)
Scale too small	Activations clipped at max (overflow)
General mismatch	Poor utilization of NVFP4's 4 quantization levels

Improvement Journey

We iterated through 5 model versions, each addressing a specific quality or performance issue:

V1: Baseline NVFP4 (External Calibration)

Used calibration data from others, not matched to this LoRA-fused model
Noticeable quality degradation vs FP8 original

V2: Custom Calibration

Generated custom calibration with 10 sample generations
Custom values ~20x larger than base calibration, confirming LoRA merge changed activation distributions
Some improvement, but skin quality still worse than FP8

V3: P99.9 Calibration + Skin-Focused Prompts (Recommended)

Increased calibration to 53 samples (150 target, OOM at 53)
Added 45+ portrait/skin-detail calibration prompts
Switched from max to P99.9 aggregation (~74% tighter scales)
Best balance of speed (1.84x) and quality

V4: Selective NVFP4

Moved attn.to_v and attn.to_out.0 from NVFP4 to FP8
Based on SVDQuant research: value projections carry content fidelity
Better quality but speedup dropped to 1.39x

V5: Text Quality Optimized

Moved img_mlp.net.0.proj to FP8 (GELU-adjacent, bimodal activation distribution)
Only 3 layer types remain NVFP4: attn.to_q, attn.to_k, img_mlp.net.2
Best quality but zero speedup (8.39s vs 8.38s baseline)

Research-Based Layer Sensitivity Ranking

From SVDQuant, ViDiT-Q, FP4DiT, and NVIDIA FLUX NVFP4 findings:

Sensitivity	Layers	Recommendation
Critical	Text-stream attention (`add_q/k/v_proj`, `to_add_out`)	Keep FP8
High	Modulation layers (`img_mod.1`, `txt_mod.1`)	Keep FP8
High	GELU-adjacent MLP (`img_mlp.net.0.proj`)	Keep FP8 (V5+)
Medium	Value/Output projections (`attn.to_v`, `attn.to_out.0`)	Keep FP8 (V4+)
Low	Post-GELU MLP (`img_mlp.net.2`)	NVFP4 tolerant
Low	Q/K projections (`attn.to_q`, `attn.to_k`)	NVFP4 tolerant

Further Improvement Plan

Three solutions remain available for teams needing even better quality:

Solution	Quality Impact	Effort	Performance Cost
Preserve FP8-to-FP8 layers	Good	Low	None
Custom calibration on actual model	High	Medium	None
Selective NVFP4 scope reduction	Good	Medium	Some slowdown

Preserve FP8-to-FP8 layers: The current pipeline re-quantizes FP8 layers through FP32 even when the target is FP8. Skipping this for FP8-targeted layers eliminates ~40% of unnecessary weight degradation with zero downside.

Full inference calibration: Running 256-512 diverse generations through the actual LoRA-merged model (varied prompts, image sizes, guidance scales, denoising steps) would produce the most accurate activation scales. The hook-based approach instruments the model during inference to capture per-layer amax values.

Selective NVFP4: Profiling which specific layers contribute most to quality degradation allows fine-tuning the NVFP4/FP8 boundary for the best quality-speed tradeoff on your specific use case.

Benchmark Results

100 prompts, RTX 5090, excluding cold-start inference:

Model	Images	Total Time	Avg/Image	Speedup	Cold Start
FP8 Baseline	98	820.8s	8.38s	1.00x	48.16s
V2 (custom)	98	443.4s	4.52s	1.85x	68.52s
V3 (p99.9)	98	446.4s	4.55s	1.84x	60.43s
V4 (selective)	98	592.6s	6.05s	1.39x	50.58s
V5 (text quality)	98	822.0s	8.39s	1.00x	64.24s

V1 produced 0 images due to loading errors (metadata corruption, fixed in later versions).

Pipeline Overview

Phase 1: Calibration (one-time per model)
  setup_calibration.sh        Patch ComfyUI to capture activation stats
  run_calibration.py          Run N generations via ComfyUI API
  convert_calibration.py      CALIB_DATA.json -> calibs/<name>.json
  restore_comfy.sh            Remove patches from ComfyUI

Phase 2: Conversion (4 steps)
  Step 1  quantize_fp8_to_nvfp4.py     FP8 -> NVFP4/FP8 per config policy
  Step 2  fix_text_encoder_scales.py    Add missing weight_scale for text encoder
  Step 3  fix_key_names.py              Strip model.diffusion_model.* prefix
  Step 4  add_input_scale.py            Apply calibration input_scale values

Quick Start

conda activate comfy

# Phase 1: Calibration
./setup_calibration.sh
python run_calibration.py
./restore_comfy.sh

# Phase 2: Conversion
./convert_with_custom_calib.sh NSFW calibs/qwen-rapid-aio-custom-p999.json

# Deploy
cp Qwen-Rapid-AIO-NSFW-v14.1-NVFP4-custom-p999.safetensors \
   /home/administrator/ComfyUI/models/checkpoints/

Architecture

The model uses a joint attention (MMDiT) architecture with 60 transformer blocks, each with parallel IMG and TXT streams:

User Prompt
    |
Separate CLIP (qwen_2.5_vl_7b_fp8_scaled.safetensors)
    |
Text Embeddings
    |
Diffusion Model (60 blocks, joint attention)
    |-- TXT stream: text embeddings guide generation (FP8, preserved)
    |-- IMG stream: image tokens through denoising (NVFP4/FP8 mix)
    |
VAE Decoder (qwen_image_vae.safetensors)
    |
Final Image

Each block has 18 layers. In V3 (recommended): 6 per block are NVFP4, 12 remain FP8.

Project Structure

qwen-fp8-to-nvfp4-quantization/
├── quantize_fp8_to_nvfp4.py        # Step 1: FP8 -> NVFP4 quantization
├── fix_text_encoder_scales.py       # Step 2: Compute text encoder weight_scale
├── fix_key_names.py                 # Step 3: Key name normalization + metadata
├── convert_with_custom_calib.sh     # 4-step conversion orchestrator
├── run_calibration.py               # Automated calibration via ComfyUI API
├── convert_calibration.py           # Convert CALIB_DATA.json format
├── setup_calibration.sh             # Apply calibration patches to ComfyUI
├── restore_comfy.sh                 # Undo calibration patches
├── comfy-dit-quantizer/             # Upstream quantizer (submodule)
│   ├── add_input_scale.py           # Step 4: Add input_scale from calibration
│   └── configs/                     # Quantization policy configs
├── calibs/                          # Custom calibration outputs
├── debug/                           # Debug and analysis tools
├── QUALITY_ANALYSIS.md              # Root cause analysis of quality issues
├── QUANTIZATION_JOURNEY.md          # Full iteration history (V1-V5)
├── CALIBRATION_GUIDE.md             # Detailed calibration reference
└── qwen_model_benchmark_report_0214.md  # Benchmark results (100 prompts)

Prerequisites

NVIDIA GPU with NVFP4 support (RTX 50 series / Blackwell)
~50 GB free disk space
Conda environment with PyTorch >= 2.x, safetensors, requests
ComfyUI installed (for calibration)

Known Issues

Issue	Cause	Fix
V1 model fails to load	`_quantization_metadata` copied from wrong reference	Fixed in V4+ with metadata patching
ComfyUI OOM after ~53 calibration runs	Calibration hooks accumulate memory	Use available data (53 samples sufficient for p99.9)
`convert_with_custom_calib.sh` path bug	Relative paths break after `cd`	Use absolute paths for Step 4
Triton `libcuda.so` linker error	Missing unversioned symlink	`sudo ln -sf /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so`

References

comfy-dit-quantizer ... core quantization tools
SVDQuant (MIT) ... layer sensitivity research
FP4DiT ... GELU-adjacent layer findings
QUALITY_ANALYSIS.md ... root cause analysis
QUANTIZATION_JOURNEY.md ... full V1-V5 iteration history
CALIBRATION_GUIDE.md ... calibration workflow reference

License

Scripts follow the same license as comfy-dit-quantizer.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support