# ERNIE-Image-NF4 `ERNIE-Image-NF4` is a BitsAndBytes 4-bit NF4 quantized version of `ERNIE-Image`. The goal is to preserve the original inference workflow as much as possible while significantly reducing model size and deployment cost. ## Quantization Setup This project uses the BitsAndBytes quantization path supported by Hugging Face Diffusers and Transformers: - Quantization type: `4-bit NF4` - Double quantization: `bnb_4bit_use_double_quant=True` - Compute dtype: `bfloat16` - Quantized components: `transformer`, `text_encoder`, `pe` - Components kept in original precision: `vae`, `scheduler`, `tokenizer`, `pe_tokenizer` Quantization metadata is stored in: - `ERNIE-Image-NF4/quantization_metadata.json` ## Comparison Results ### Total Size | Item | Total Size | |---|---:| | Original ERNIE-Image | 31,648,081,051 B | | ERNIE-Image-NF4 | 9,301,075,597 B | Summary: - The quantized directory is about `29.4%` of the original size - Total size is reduced by about `70.61%` - The full model directory is reduced to about `1/3.4` of the original ### Key Weight Files | Component | Before Quantization | After Quantization | Reduction | |---|---:|---:|---:| | `transformer` | 16,067,025,160 B | 4,146,045,428 B | 74.20% | | `text_encoder` | 7,698,241,104 B | 2,585,177,824 B | 66.42% | | `pe` | 7,663,346,520 B | 2,367,115,496 B | 69.11% | Notes: - The original `transformer` is stored in two shards, while the quantized version is saved as a single `transformer/diffusion_pytorch_model.safetensors` - `vae`, `scheduler`, `tokenizer`, and `pe_tokenizer` keep their original precision and directory structure ## Practical Validation The quantized model has been validated with end-to-end image generation, which means this is not only a weight conversion result but a runnable model package: - Validation environment: `conda` environment `ernie-image` - GPU: `NVIDIA GeForce RTX 3060` - Resolution: `512x512` - Parameters: `num_inference_steps=50`, `guidance_scale=4.0`, `use_pe=True` - Output image: `ERNIE-Image-NF4/demo_output.png` ## Performance Experiment To measure the practical gain after quantization, an additional benchmark was run under the same prompt and inference settings. Experiment setup: - GPU: `NVIDIA GeForce RTX 3060` - Prompt: `一只橙色的小猫坐在木桌上,前面放着一张写有 Hello Ernie-Image NF4 的白纸,柔和自然光,写实风格。` - Resolution: `512x512` - Inference steps: `50` - `guidance_scale=4.0` - `seed=1234` - Original model execution: `enable_model_cpu_offload()` - Quantized model execution: quantized `transformer`, `text_encoder`, and `pe` loaded on GPU, with `vae` kept in `bfloat16` Results: | Metric | Original ERNIE-Image | ERNIE-Image-NF4 | Observation | |---|---:|---:|---| | Load time | 8.15 s | 28.56 s | Quantized model loads slower | | Inference time | 213.69 s | 29.04 s | Quantized model is about `7.36x` faster | | Total time | 221.83 s | 57.60 s | Quantized model is about `3.85x` faster overall | | Peak reserved VRAM | 15.64 GiB | 9.94 GiB | Peak memory drops by about `36.41%` | Interpretation: - The original model can run on the current machine only with CPU offload, but inference becomes much slower - The quantized model has a more complex cold start, so load time is longer - Once sampling starts, the quantized model is much faster in the current setup - Lower peak VRAM is one of the main reasons why the quantized version is easier to run on this GPU ## Quick Start Use the existing `conda` environment: ```bash conda run -n ernie-image python quantize_ernie_image_nf4.py --overwrite conda run -n ernie-image python validate_quantized_ernie_image.py ``` The default quantized model root directory is: ```text ERNIE-Image-NF4/ ``` ## Inference Demo Below is a minimal runnable example: ```python from pathlib import Path import torch from diffusers import AutoModel, ErnieImagePipeline from transformers import AutoModel as TransformersAutoModel from transformers import AutoModelForCausalLM model_dir = Path("ERNIE-Image-NF4") transformer = AutoModel.from_pretrained( str(model_dir / "transformer"), dtype=torch.bfloat16, device_map={"": 0}, local_files_only=True, ) text_encoder = TransformersAutoModel.from_pretrained( str(model_dir / "text_encoder"), dtype=torch.bfloat16, device_map={"": 0}, local_files_only=True, ) pe = AutoModelForCausalLM.from_pretrained( str(model_dir / "pe"), dtype=torch.bfloat16, device_map={"": 0}, local_files_only=True, ) pipe = ErnieImagePipeline.from_pretrained( str(model_dir), transformer=transformer, text_encoder=text_encoder, pe=pe, dtype=torch.bfloat16, local_files_only=True, ) pipe.vae.to("cuda", dtype=torch.bfloat16) image = pipe( prompt="一只橙色的小猫坐在木桌上,前面放着一张写有 Hello Ernie-Image NF4 的白纸,柔和自然光,写实风格。", height=512, width=512, num_inference_steps=50, guidance_scale=4.0, use_pe=True, generator=torch.Generator(device="cuda").manual_seed(1234), ).images[0] image.save(model_dir / "demo_output.png") ``` # Reference - [ERNIE-Image](https://huggingface.co/baidu/ERNIE-Image) - [BitsAndBytes](https://hugging-face.cn/docs/transformers/quantization/bitsandbytes) - [Diffusers](https://github.com/huggingface/diffusers)