Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v-NVFP4

File size: 8,521 Bytes

1b14ee7
 
 
 
 
 
 
 
 
 
 
 
9fcb7d2
 
1b14ee7
 
 
 
 
9fcb7d2
 
1b14ee7
9fcb7d2
1b14ee7
 
9fcb7d2
1b14ee7
 
 
69be876
1b14ee7
69be876
1b14ee7
2c10e3d
 
 
d050f66
a91d514
 
23b32fd
d050f66
 
9fcb7d2
b784299
69be876
 
 
49dc9e3
69be876
49dc9e3
69be876
49dc9e3
7263c42
 
49dc9e3
 
 
 
 
 
69be876
49dc9e3
 
 
69be876
78aff7c
d97fcc8
78aff7c
d97fcc8
 
 
 
78aff7c
 
 
 
 
 
d97fcc8
78aff7c
d97fcc8
 
78aff7c
 
 
 
1b14ee7
4ef9dd2
 
 
 
 
 
 
 
 
 
 
 
 
49dc9e3
4ef9dd2
 
 
1b14ee7
69be876
1b14ee7
 
69be876

---
license: apache-2.0
language:
- en
- zh
pipeline_tag: image-to-video
tags:
  - video generation
  - diffusion-single-file
  - comfyui
  - distillation
  - LoRA
  - quantization
  - nvfp4
library_name: diffusers
inference:
  parameters:
    num_inference_steps: 4
base_model:
- lightx2v/Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v
base_model_relation: quantized
---
# Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v-NVFP4

<p align="center">
    <img src="assets/img_lightx2v.png" width=33%/>
<p>

## Overview
This is a **partial NVFP4 quantization** of [Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v) by lightx2v, produced using [convert_to_quant](https://github.com/silveroxides/convert_to_quant) by [silveroxides](https://huggingface.co/silveroxides).

[Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v](https://huggingface.co/lightx2v/Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v) is an image-to-video generation model built on [Wan2.1-I2V-14B-480P](https://huggingface.co/Wan-AI/Wan2.1-I2V-14B-480P). It applies step distillation and classifier-free guidance distillation to reduce inference to **4 steps** without CFG, cutting generation time substantially while preserving output quality.

### IMPORTANT
Since NVFP4 is only supported on NVIDIA Blackwell architecture GPUs, running this model requires a Blackwell GPU with its corresponding support enabled in torch, along with a recent version of ComfyUI and [comfy-kitchen](https://github.com/Comfy-Org/comfy-kitchen) built against CUDA 13.

<div style="display: flex; align-items: center; gap: 16px;">
  <img src="assets/wan21_input_cat.png" width="45%"/>
  <span style="font-size: 2em;">➡</span>
  <video poster="assets/wan21_input_cat.png" src="https://huggingface.co/InsecureErasure/Wan2.1-I2V-14B-480P-StepDistill-CfgDistill-Lightx2v-NVFP4/resolve/main/assets/wan21_output_cat.mp4" width="45%" controls autoplay loop muted></video>
</div>

## Quantization
The model weights have been partially quantized to **NVFP4** (NVIDIA Floating Point 4-bit) and **MXFP8**, quantization formats supported on NVIDIA Blackwell architecture GPUs.

The quantization format assigned to each layer is based on a sensitivity analysis performed with a custom script, which scores each weight tensor using excess kurtosis, dynamic range, and aspect ratio. Thresholds are derived automatically from the model's own score distribution.

The analysis yields the following `convert_to_quant` parameters. This conversion takes about 4 hours on an RTX 5060 resulting in a 9.76 GiB safetensors file.
```bash
$ convert_to_quant -i "${1}" \
  --nvfp4 --wan --comfy_quant --save-quant-metadata \
  --custom-type mxfp8 \
  --custom-layers "blocks\.(1|2|3)\.cross_attn\.k\.weight|blocks\.(6|8|9|10)\.cross_attn\.k\.weight|blocks\.(0|1|2|3)\.cross_attn\.v\.weight|blocks\.(6)\.cross_attn\.q\.weight|blocks\.(6|14)\.cross_attn\.o\.weight|blocks\.(0|1|2|3)\.cross_attn\.v_img\.weight|blocks\.(0)\.self_attn\.k\.weight|blocks\.(7|9|10|12|13|14)\.self_attn\.k\.weight|blocks\.(19)\.self_attn\.q\.weight|blocks\.(0|1|2|3)\.ffn\.0\.weight|blocks\.(36|37|38|39)\.ffn\.0\.weight" \
  --exclude-layers "blocks\.(4|5|7)\.cross_attn\.k\.weight|blocks\.(0)\.cross_attn\.q\.weight|blocks\.(5|7|9|10|11|12|19|20)\.cross_attn\.o\.weight|blocks\.(8|11|33)\.self_attn\.k\.weight|blocks\.(38)\.self_attn\.k\.weight|blocks\.(14|16|17)\.self_attn\.q\.weight" \
  --num-iter 6000 \
  --top-p 0.35 \
  --calib-samples 8192 \
  --extract-lora --lora-rank 64 \
  --lora-target "ffn\.(0|2)\.weight|self_attn\.(v|o)\.weight" \
  -o "${1%%.safetensors}-nvfp4.safetensors"
```

A rank-64 LoRA is also generated that can be used to minimise the effects of the resulting quantization.

The table below details the quantization format applied per layer type across block ranges:
| **Layer** | **BF16** | **MXFP8** | **NVFP4** |                                                                                                                    
|:----:|:-------:|:--------:|:--------:|                                                                                                                     
| `cross_attn.k` | 7.5% | 17.5% | 75.0% |                                                                                                                    
| `cross_attn.k_img` | — | — | **100%** |                                                                                                                    
| `cross_attn.norm_k` | **100%** | — | — |                                                                                                                   
| `cross_attn.norm_k_img` | **100%** | — | — |                                                                                                               
| `cross_attn.norm_q` | **100%** | — | — |                                                                                                                   
| `cross_attn.o` | 20.0% | 5.0% | 75.0% |                                                                                                                    
| `cross_attn.q` | 2.5% | 2.5% | 95.0% |                                                                                                                     
| `cross_attn.v` | — | 10.0% | 90.0% |                                                                                                                       
| `cross_attn.v_img` | — | 10.0% | 90.0% |                                                                                                                   
| `ffn.0` | — | 20.0% | 80.0% |                                                                                                                              
| `ffn.2` | — | — | **100%** |                                                                                                                               
| `norm3` | **100%** | — | — |                                                                                                                               
| `self_attn.k` | 10.0% | 17.5% | 72.5% |                                                                                                                    
| `self_attn.norm_k` | **100%** | — | — |                                                                                                                    
| `self_attn.norm_q` | **100%** | — | — |                                                                                                                    
| `self_attn.o` | — | — | **100%** |                                                                                                                         
| `self_attn.q` | 7.5% | 2.5% | 90.0% |                                                                                                                      
| `self_attn.v` | — | — | **100%** |                                                                                                                         
| **Total** | **36.0%** | **4.7%** | **59.3%** |     

## Inference
The model can be used in ComfyUI with the following parameters, based on the distilled model's own recommendations:

| Parameter | Value |
|-----------|-------|
| Shift | 5.0 |
| Sampler | LCM |
| Scheduler | normal |
| CFG | 1.0 |
| Steps | 4 |

The combinations euler/simple and heun/linear_quadratic (sampler/scheduler) are also known to produce good results.

The model is designed to generate 81 frames and is compatible with LoRAs. Sampling completes in under 60 seconds on an RTX 5060, making it possible to produce a full 81-frame video in under two minutes; with RIFE, those 81 frames convert to a 10-second video.

Abrupt camera movements or fast subject motion may produce artifacts. This is an inherent limitation of applying aggressive quantization to an already distilled model.

## License Agreement
This model is licensed under the [Apache 2.0 License](LICENSE.txt). You retain full ownership of your generated content, but are solely responsible for its use in compliance with the license terms and applicable laws.

## Acknowledgements
Big kudos to the contributors to the [Wan2.1](https://huggingface.co/Wan-AI/Wan2.1-T2V-14B) and [Self-Forcing](https://huggingface.co/gdhe17/Self-Forcing/tree/main) repositories for their open research, and to [silveroxides](https://huggingface.co/silveroxides) for their quantization tools.