YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Qwen FP8 to NVFP4 Quantization Pipeline
Convert Qwen-Rapid-AIO diffusion models from FP8 to NVFP4 format for faster inference on RTX 5090 (Blackwell) with custom calibration to preserve image quality.
Source Model: Qwen-Rapid-AIO-NSFW-v14.1 (27GB, FP8, LoRA-fused) Inference Stack: ComfyUI on RTX 5090 with native NVFP4 support Recommended Model: V3_NVFP4_custom_p999 ... best cost-performance balance
Why This Exists
The official comfy-dit-quantizer assumes BF16 source models. Our source is already FP8, which means:
- FP8 weights must be upcast to FP32 before re-quantizing to NVFP4 (double quantization)
- Text encoder FP8 layers are missing
weight_scaletensors - Key names need normalization from
model.diffusion_model.*to diffusers format - Base calibration data doesn't match our LoRA-merged model's activation distributions
This pipeline solves all four problems.
Recommended Model: V3_NVFP4_custom_p999
After benchmarking all versions across 100 prompts, V3 delivers the best cost-performance ratio: 1.84x speedup over FP8 baseline with the tightest calibration scales.
| Model | Avg Time/Image | Speedup | NVFP4 Layers | FP8 Layers | Key Trait |
|---|---|---|---|---|---|
| FP8 Baseline | 8.38s | 1.00x | 0 | All | Reference quality |
| V2 (custom calib) | 4.52s | 1.85x | 360 | 599 | First custom calibration |
| V3 (p99.9 calib) | 4.55s | 1.84x | 360 | 599 | Best cost-performance |
| V4 (selective) | 6.05s | 1.39x | 240 | 719 | Better quality, slower |
| V5 (text quality) | 8.39s | 1.00x | 180 | 779 | Best quality, no speedup |
V3 uses P99.9 percentile calibration from 53 sample generations with skin-focused prompts. The p99.9 method produces ~74% tighter scales than max aggregation, giving NVFP4 layers better dynamic range utilization without being dominated by rare outlier activations.
V4 and V5 trade speed for quality by keeping more layers at FP8. V5 in particular loses all speedup benefit, making it impractical for production use.
The Problem: Quality Degradation in NVFP4
Converting FP8 to NVFP4 introduces two compounding quality issues:
1. Double Quantization
The pipeline stacks two aggressive quantizations, losing ~5+ bits of precision from the original BF16:
Original BF16 (7 mantissa bits)
-> FP8 E4M3 (3 mantissa bits) ~4 bits lost
-> Upcast to FP32 No recovery, same 3 bits of real info
-> NVFP4 (2 bits, 4 quant levels) ~5+ bits total lost from original
The upcast to FP32 does NOT recover original precision. It just represents already-degraded FP8 values in a larger format.
Constraint: We cannot access the original BF16 weights or the original FP8 quantization scales. The Qwen-Rapid-AIO models are only distributed in FP8 format.
2. Calibration Mismatch
The base calibration (calibs/qwen-image-edit-2511.json) was generated from the original qwen-image-edit-2511 model, not our LoRA-merged variant. When LoRAs are merged (W' = W + BA), weight magnitudes and activation ranges shift significantly. Our custom calibration values were ~20x larger than the base calibration, confirming the mismatch.
| Mismatch Type | Effect on Output |
|---|---|
| Scale too large | Activations quantized toward zero (underflow) |
| Scale too small | Activations clipped at max (overflow) |
| General mismatch | Poor utilization of NVFP4's 4 quantization levels |
Improvement Journey
We iterated through 5 model versions, each addressing a specific quality or performance issue:
V1: Baseline NVFP4 (External Calibration)
- Used calibration data from others, not matched to this LoRA-fused model
- Noticeable quality degradation vs FP8 original
V2: Custom Calibration
- Generated custom calibration with 10 sample generations
- Custom values ~20x larger than base calibration, confirming LoRA merge changed activation distributions
- Some improvement, but skin quality still worse than FP8
V3: P99.9 Calibration + Skin-Focused Prompts (Recommended)
- Increased calibration to 53 samples (150 target, OOM at 53)
- Added 45+ portrait/skin-detail calibration prompts
- Switched from max to P99.9 aggregation (~74% tighter scales)
- Best balance of speed (1.84x) and quality
V4: Selective NVFP4
- Moved
attn.to_vandattn.to_out.0from NVFP4 to FP8 - Based on SVDQuant research: value projections carry content fidelity
- Better quality but speedup dropped to 1.39x
V5: Text Quality Optimized
- Moved
img_mlp.net.0.projto FP8 (GELU-adjacent, bimodal activation distribution) - Only 3 layer types remain NVFP4:
attn.to_q,attn.to_k,img_mlp.net.2 - Best quality but zero speedup (8.39s vs 8.38s baseline)
Research-Based Layer Sensitivity Ranking
From SVDQuant, ViDiT-Q, FP4DiT, and NVIDIA FLUX NVFP4 findings:
| Sensitivity | Layers | Recommendation |
|---|---|---|
| Critical | Text-stream attention (add_q/k/v_proj, to_add_out) |
Keep FP8 |
| High | Modulation layers (img_mod.1, txt_mod.1) |
Keep FP8 |
| High | GELU-adjacent MLP (img_mlp.net.0.proj) |
Keep FP8 (V5+) |
| Medium | Value/Output projections (attn.to_v, attn.to_out.0) |
Keep FP8 (V4+) |
| Low | Post-GELU MLP (img_mlp.net.2) |
NVFP4 tolerant |
| Low | Q/K projections (attn.to_q, attn.to_k) |
NVFP4 tolerant |
Further Improvement Plan
Three solutions remain available for teams needing even better quality:
| Solution | Quality Impact | Effort | Performance Cost |
|---|---|---|---|
| Preserve FP8-to-FP8 layers | Good | Low | None |
| Custom calibration on actual model | High | Medium | None |
| Selective NVFP4 scope reduction | Good | Medium | Some slowdown |
Preserve FP8-to-FP8 layers: The current pipeline re-quantizes FP8 layers through FP32 even when the target is FP8. Skipping this for FP8-targeted layers eliminates ~40% of unnecessary weight degradation with zero downside.
Full inference calibration: Running 256-512 diverse generations through the actual LoRA-merged model (varied prompts, image sizes, guidance scales, denoising steps) would produce the most accurate activation scales. The hook-based approach instruments the model during inference to capture per-layer amax values.
Selective NVFP4: Profiling which specific layers contribute most to quality degradation allows fine-tuning the NVFP4/FP8 boundary for the best quality-speed tradeoff on your specific use case.
Benchmark Results
100 prompts, RTX 5090, excluding cold-start inference:
| Model | Images | Total Time | Avg/Image | Speedup | Cold Start |
|---|---|---|---|---|---|
| FP8 Baseline | 98 | 820.8s | 8.38s | 1.00x | 48.16s |
| V2 (custom) | 98 | 443.4s | 4.52s | 1.85x | 68.52s |
| V3 (p99.9) | 98 | 446.4s | 4.55s | 1.84x | 60.43s |
| V4 (selective) | 98 | 592.6s | 6.05s | 1.39x | 50.58s |
| V5 (text quality) | 98 | 822.0s | 8.39s | 1.00x | 64.24s |
V1 produced 0 images due to loading errors (metadata corruption, fixed in later versions).
Pipeline Overview
Phase 1: Calibration (one-time per model)
setup_calibration.sh Patch ComfyUI to capture activation stats
run_calibration.py Run N generations via ComfyUI API
convert_calibration.py CALIB_DATA.json -> calibs/<name>.json
restore_comfy.sh Remove patches from ComfyUI
Phase 2: Conversion (4 steps)
Step 1 quantize_fp8_to_nvfp4.py FP8 -> NVFP4/FP8 per config policy
Step 2 fix_text_encoder_scales.py Add missing weight_scale for text encoder
Step 3 fix_key_names.py Strip model.diffusion_model.* prefix
Step 4 add_input_scale.py Apply calibration input_scale values
Quick Start
conda activate comfy
# Phase 1: Calibration
./setup_calibration.sh
python run_calibration.py
./restore_comfy.sh
# Phase 2: Conversion
./convert_with_custom_calib.sh NSFW calibs/qwen-rapid-aio-custom-p999.json
# Deploy
cp Qwen-Rapid-AIO-NSFW-v14.1-NVFP4-custom-p999.safetensors \
/home/administrator/ComfyUI/models/checkpoints/
Architecture
The model uses a joint attention (MMDiT) architecture with 60 transformer blocks, each with parallel IMG and TXT streams:
User Prompt
|
Separate CLIP (qwen_2.5_vl_7b_fp8_scaled.safetensors)
|
Text Embeddings
|
Diffusion Model (60 blocks, joint attention)
|-- TXT stream: text embeddings guide generation (FP8, preserved)
|-- IMG stream: image tokens through denoising (NVFP4/FP8 mix)
|
VAE Decoder (qwen_image_vae.safetensors)
|
Final Image
Each block has 18 layers. In V3 (recommended): 6 per block are NVFP4, 12 remain FP8.
Project Structure
qwen-fp8-to-nvfp4-quantization/
βββ quantize_fp8_to_nvfp4.py # Step 1: FP8 -> NVFP4 quantization
βββ fix_text_encoder_scales.py # Step 2: Compute text encoder weight_scale
βββ fix_key_names.py # Step 3: Key name normalization + metadata
βββ convert_with_custom_calib.sh # 4-step conversion orchestrator
βββ run_calibration.py # Automated calibration via ComfyUI API
βββ convert_calibration.py # Convert CALIB_DATA.json format
βββ setup_calibration.sh # Apply calibration patches to ComfyUI
βββ restore_comfy.sh # Undo calibration patches
βββ comfy-dit-quantizer/ # Upstream quantizer (submodule)
β βββ add_input_scale.py # Step 4: Add input_scale from calibration
β βββ configs/ # Quantization policy configs
βββ calibs/ # Custom calibration outputs
βββ debug/ # Debug and analysis tools
βββ QUALITY_ANALYSIS.md # Root cause analysis of quality issues
βββ QUANTIZATION_JOURNEY.md # Full iteration history (V1-V5)
βββ CALIBRATION_GUIDE.md # Detailed calibration reference
βββ qwen_model_benchmark_report_0214.md # Benchmark results (100 prompts)
Prerequisites
- NVIDIA GPU with NVFP4 support (RTX 50 series / Blackwell)
- ~50 GB free disk space
- Conda environment with PyTorch >= 2.x,
safetensors,requests - ComfyUI installed (for calibration)
Known Issues
| Issue | Cause | Fix |
|---|---|---|
| V1 model fails to load | _quantization_metadata copied from wrong reference |
Fixed in V4+ with metadata patching |
| ComfyUI OOM after ~53 calibration runs | Calibration hooks accumulate memory | Use available data (53 samples sufficient for p99.9) |
convert_with_custom_calib.sh path bug |
Relative paths break after cd |
Use absolute paths for Step 4 |
Triton libcuda.so linker error |
Missing unversioned symlink | sudo ln -sf /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/lib/x86_64-linux-gnu/libcuda.so |
References
- comfy-dit-quantizer ... core quantization tools
- SVDQuant (MIT) ... layer sensitivity research
- FP4DiT ... GELU-adjacent layer findings
- QUALITY_ANALYSIS.md ... root cause analysis
- QUANTIZATION_JOURNEY.md ... full V1-V5 iteration history
- CALIBRATION_GUIDE.md ... calibration workflow reference
License
Scripts follow the same license as comfy-dit-quantizer.