YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen FP8 to NVFP4 Quantization Pipeline

Convert Qwen-Rapid-AIO diffusion models from FP8 to NVFP4 format for faster inference on RTX 5090 (Blackwell) with custom calibration to preserve image quality.

Source Model: Qwen-Rapid-AIO-NSFW-v14.1 (27GB, FP8, LoRA-fused) Inference Stack: ComfyUI on RTX 5090 with native NVFP4 support Recommended Model: V3_NVFP4_custom_p999 ... best cost-performance balance


Why This Exists

The official comfy-dit-quantizer assumes BF16 source models. Our source is already FP8, which means:

  1. FP8 weights must be upcast to FP32 before re-quantizing to NVFP4 (double quantization)
  2. Text encoder FP8 layers are missing weight_scale tensors
  3. Key names need normalization from model.diffusion_model.* to diffusers format
  4. Base calibration data doesn't match our LoRA-merged model's activation distributions

This pipeline solves all four problems.


Recommended Model: V3_NVFP4_custom_p999

After benchmarking all versions across 100 prompts, V3 delivers the best cost-performance ratio: 1.84x speedup over FP8 baseline with the tightest calibration scales.

Model Avg Time/Image Speedup NVFP4 Layers FP8 Layers Key Trait
FP8 Baseline 8.38s 1.00x 0 All Reference quality
V2 (custom calib) 4.52s 1.85x 360 599 First custom calibration
V3 (p99.9 calib) 4.55s 1.84x 360 599 Best cost-performance
V4 (selective) 6.05s 1.39x 240 719 Better quality, slower
V5 (text quality) 8.39s 1.00x 180 779 Best quality, no speedup

V3 uses P99.9 percentile calibration from 53 sample generations with skin-focused prompts. The p99.9 method produces ~74% tighter scales than max aggregation, giving NVFP4 layers better dynamic range utilization without being dominated by rare outlier activations.

V4 and V5 trade speed for quality by keeping more layers at FP8. V5 in particular loses all speedup benefit, making it impractical for production use.


The Problem: Quality Degradation in NVFP4

Converting FP8 to NVFP4 introduces two compounding quality issues:

1. Double Quantization

The pipeline stacks two aggressive quantizations, losing ~5+ bits of precision from the original BF16:

Original BF16 (7 mantissa bits)
    -> FP8 E4M3 (3 mantissa bits)     ~4 bits lost
    -> Upcast to FP32                  No recovery, same 3 bits of real info
    -> NVFP4 (2 bits, 4 quant levels)  ~5+ bits total lost from original

The upcast to FP32 does NOT recover original precision. It just represents already-degraded FP8 values in a larger format.

Constraint: We cannot access the original BF16 weights or the original FP8 quantization scales. The Qwen-Rapid-AIO models are only distributed in FP8 format.

2. Calibration Mismatch

The base calibration (calibs/qwen-image-edit-2511.json) was generated from the original qwen-image-edit-2511 model, not our LoRA-merged variant. When LoRAs are merged (W' = W + BA), weight magnitudes and activation ranges shift significantly. Our custom calibration values were ~20x larger than the base calibration, confirming the mismatch.

Mismatch Type Effect on Output
Scale too large Activations quantized toward zero (underflow)
Scale too small Activations clipped at max (overflow)
General mismatch Poor utilization of NVFP4's 4 quantization levels

Improvement Journey

We iterated through 5 model versions, each addressing a specific quality or performance issue:

V1: Baseline NVFP4 (External Calibration)

  • Used calibration data from others, not matched to this LoRA-fused model
  • Noticeable quality degradation vs FP8 original

V2: Custom Calibration

  • Generated custom calibration with 10 sample generations
  • Custom values ~20x larger than base calibration, confirming LoRA merge changed activation distributions
  • Some improvement, but skin quality still worse than FP8

V3: P99.9 Calibration + Skin-Focused Prompts (Recommended)

  • Increased calibration to 53 samples (150 target, OOM at 53)
  • Added 45+ portrait/skin-detail calibration prompts
  • Switched from max to P99.9 aggregation (~74% tighter scales)
  • Best balance of speed (1.84x) and quality

V4: Selective NVFP4

  • Moved attn.to_v and attn.to_out.0 from NVFP4 to FP8
  • Based on SVDQuant research: value projections carry content fidelity
  • Better quality but speedup dropped to 1.39x

V5: Text Quality Optimized

  • Moved img_mlp.net.0.proj to FP8 (GELU-adjacent, bimodal activation distribution)
  • Only 3 layer types remain NVFP4: attn.to_q, attn.to_k, img_mlp.net.2
  • Best quality but zero speedup (8.39s vs 8.38s baseline)

Research-Based Layer Sensitivity Ranking

From SVDQuant, ViDiT-Q, FP4DiT, and NVIDIA FLUX NVFP4 findings:

Sensitivity Layers Recommendation
Critical Text-stream attention (add_q/k/v_proj, to_add_out) Keep FP8
High Modulation layers (img_mod.1, txt_mod.1) Keep FP8
High GELU-adjacent MLP (img_mlp.net.0.proj) Keep FP8 (V5+)
Medium Value/Output projections (attn.to_v, attn.to_out.0) Keep FP8 (V4+)
Low Post-GELU MLP (img_mlp.net.2) NVFP4 tolerant
Low Q/K projections (attn.to_q, attn.to_k) NVFP4 tolerant

Further Improvement Plan

Three solutions remain available for teams needing even better quality:

Solution Quality Impact Effort Performance Cost
Preserve FP8-to-FP8 layers Good Low None
Custom calibration on actual model High Medium None
Selective NVFP4 scope reduction Good Medium Some slowdown

Preserve FP8-to-FP8 layers: The current pipeline re-quantizes FP8 layers through FP32 even when the target is FP8. Skipping this for FP8-targeted layers eliminates ~40% of unnecessary weight degradation with zero downside.

Full inference calibration: Running 256-512 diverse generations through the actual LoRA-merged model (varied prompts, image sizes, guidance scales, denoising steps) would produce the most accurate activation scales. The hook-based approach instruments the model during inference to capture per-layer amax values.

Selective NVFP4: Profiling which specific layers contribute most to quality degradation allows fine-tuning the NVFP4/FP8 boundary for the best quality-speed tradeoff on your specific use case.


Benchmark Results

100 prompts, RTX 5090, excluding cold-start inference:

Model Images Total Time Avg/Image Speedup Cold Start
FP8 Baseline 98 820.8s 8.38s 1.00x 48.16s
V2 (custom) 98 443.4s 4.52s 1.85x 68.52s
V3 (p99.9) 98 446.4s 4.55s 1.84x 60.43s
V4 (selective) 98 592.6s 6.05s 1.39x 50.58s
V5 (text quality) 98 822.0s 8.39s 1.00x 64.24s

V1 produced 0 images due to loading errors (metadata corruption, fixed in later versions).


Pipeline Overview

Phase 1: Calibration (one-time per model)
  setup_calibration.sh        Patch ComfyUI to capture activation stats
  run_calibration.py          Run N generations via ComfyUI API
  convert_calibration.py      CALIB_DATA.json -> calibs/<name>.json
  restore_comfy.sh            Remove patches from ComfyUI

Phase 2: Conversion (4 steps)
  Step 1  quantize_fp8_to_nvfp4.py     FP8 -> NVFP4/FP8 per config policy
  Step 2  fix_text_encoder_scales.py    Add missing weight_scale for text encoder
  Step 3  fix_key_names.py              Strip model.diffusion_model.* prefix
  Step 4  add_input_scale.py            Apply calibration input_scale values

Quick Start

conda activate comfy

# Phase 1: Calibration
./setup_calibration.sh
python run_calibration.py
./restore_comfy.sh

# Phase 2: Conversion
./convert_with_custom_calib.sh NSFW calibs/qwen-rapid-aio-custom-p999.json

# Deploy
cp Qwen-Rapid-AIO-NSFW-v14.1-NVFP4-custom-p999.safetensors \
   /home/administrator/ComfyUI/models/checkpoints/

Architecture

The model uses a joint attention (MMDiT) architecture with 60 transformer blocks, each with parallel IMG and TXT streams:

User Prompt
    |
Separate CLIP (qwen_2.5_vl_7b_fp8_scaled.safetensors)
    |
Text Embeddings
    |
Diffusion Model (60 blocks, joint attention)
    |-- TXT stream: text embeddings guide generation (FP8, preserved)
    |-- IMG stream: image tokens through denoising (NVFP4/FP8 mix)
    |
VAE Decoder (qwen_image_vae.safetensors)
    |
Final Image

Each block has 18 layers. In V3 (recommended): 6 per block are NVFP4, 12 remain FP8.


Project Structure

qwen-fp8-to-nvfp4-quantization/
β”œβ”€β”€ quantize_fp8_to_nvfp4.py        # Step 1: FP8 -> NVFP4 quantization
β”œβ”€β”€ fix_text_encoder_scales.py       # Step 2: Compute text encoder weight_scale
β”œβ”€β”€ fix_key_names.py                 # Step 3: Key name normalization + metadata
β”œβ”€β”€ convert_with_custom_calib.sh     # 4-step conversion orchestrator
β”œβ”€β”€ run_calibration.py               # Automated calibration via ComfyUI API
β”œβ”€β”€ convert_calibration.py           # Convert CALIB_DATA.json format
β”œβ”€β”€ setup_calibration.sh             # Apply calibration patches to ComfyUI
β”œβ”€β”€ restore_comfy.sh                 # Undo calibration patches
β”œβ”€β”€ comfy-dit-quantizer/             # Upstream quantizer (submodule)
β”‚   β”œβ”€β”€ add_input_scale.py           # Step 4: Add input_scale from calibration
β”‚   └── configs/                     # Quantization policy configs
β”œβ”€β”€ calibs/                          # Custom calibration outputs
β”œβ”€β”€ debug/                           # Debug and analysis tools
β”œβ”€β”€ QUALITY_ANALYSIS.md              # Root cause analysis of quality issues
β”œβ”€β”€ QUANTIZATION_JOURNEY.md          # Full iteration history (V1-V5)
β”œβ”€β”€ CALIBRATION_GUIDE.md             # Detailed calibration reference
└── qwen_model_benchmark_report_0214.md  # Benchmark results (100 prompts)

Prerequisites

  • NVIDIA GPU with NVFP4 support (RTX 50 series / Blackwell)
  • ~50 GB free disk space
  • Conda environment with PyTorch >= 2.x, safetensors, requests
  • ComfyUI installed (for calibration)

Known Issues

Issue Cause Fix
V1 model fails to load _quantization_metadata copied from wrong reference Fixed in V4+ with metadata patching
ComfyUI OOM after ~53 calibration runs Calibration hooks accumulate memory Use available data (53 samples sufficient for p99.9)
convert_with_custom_calib.sh path bug Relative paths break after cd Use absolute paths for Step 4
Triton libcuda.so linker error Missing unversioned symlink sudo ln -sf /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so

References

License

Scripts follow the same license as comfy-dit-quantizer.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support