# ERNIE-Image-NF4

`ERNIE-Image-NF4` is a BitsAndBytes 4-bit NF4 quantized version of `ERNIE-Image`. The goal is to preserve the original inference workflow as much as possible while significantly reducing model size and deployment cost.

## Quantization Setup

This project uses the BitsAndBytes quantization path supported by Hugging Face Diffusers and Transformers:

- Quantization type: `4-bit NF4`
- Double quantization: `bnb_4bit_use_double_quant=True`
- Compute dtype: `bfloat16`
- Quantized components: `transformer`, `text_encoder`, `pe`
- Components kept in original precision: `vae`, `scheduler`, `tokenizer`, `pe_tokenizer`

Quantization metadata is stored in:

- `ERNIE-Image-NF4/quantization_metadata.json`

## Comparison Results

### Total Size

| Item | Total Size |
|---|---:|
| Original ERNIE-Image | 31,648,081,051 B |
| ERNIE-Image-NF4 | 9,301,075,597 B |

Summary:

- The quantized directory is about `29.4%` of the original size
- Total size is reduced by about `70.61%`
- The full model directory is reduced to about `1/3.4` of the original

### Key Weight Files

| Component | Before Quantization | After Quantization | Reduction |
|---|---:|---:|---:|
| `transformer` | 16,067,025,160 B | 4,146,045,428 B | 74.20% |
| `text_encoder` | 7,698,241,104 B | 2,585,177,824 B | 66.42% |
| `pe` | 7,663,346,520 B | 2,367,115,496 B | 69.11% |

Notes:

- The original `transformer` is stored in two shards, while the quantized version is saved as a single `transformer/diffusion_pytorch_model.safetensors`
- `vae`, `scheduler`, `tokenizer`, and `pe_tokenizer` keep their original precision and directory structure

## Practical Validation

The quantized model has been validated with end-to-end image generation, which means this is not only a weight conversion result but a runnable model package:

- Validation environment: `conda` environment `ernie-image`
- GPU: `NVIDIA GeForce RTX 3060`
- Resolution: `512x512`
- Parameters: `num_inference_steps=50`, `guidance_scale=4.0`, `use_pe=True`
- Output image: `ERNIE-Image-NF4/demo_output.png`

## Performance Experiment

To measure the practical gain after quantization, an additional benchmark was run under the same prompt and inference settings.

Experiment setup:

- GPU: `NVIDIA GeForce RTX 3060`
- Prompt: `一只橙色的小猫坐在木桌上，前面放着一张写有 Hello Ernie-Image NF4 的白纸，柔和自然光，写实风格。`
- Resolution: `512x512`
- Inference steps: `50`
- `guidance_scale=4.0`
- `seed=1234`
- Original model execution: `enable_model_cpu_offload()`
- Quantized model execution: quantized `transformer`, `text_encoder`, and `pe` loaded on GPU, with `vae` kept in `bfloat16`

Results:

| Metric | Original ERNIE-Image | ERNIE-Image-NF4 | Observation |
|---|---:|---:|---|
| Load time | 8.15 s | 28.56 s | Quantized model loads slower |
| Inference time | 213.69 s | 29.04 s | Quantized model is about `7.36x` faster |
| Total time | 221.83 s | 57.60 s | Quantized model is about `3.85x` faster overall |
| Peak reserved VRAM | 15.64 GiB | 9.94 GiB | Peak memory drops by about `36.41%` |

Interpretation:

- The original model can run on the current machine only with CPU offload, but inference becomes much slower
- The quantized model has a more complex cold start, so load time is longer
- Once sampling starts, the quantized model is much faster in the current setup
- Lower peak VRAM is one of the main reasons why the quantized version is easier to run on this GPU

## Quick Start

Use the existing `conda` environment:

```bash
conda run -n ernie-image python quantize_ernie_image_nf4.py --overwrite
conda run -n ernie-image python validate_quantized_ernie_image.py
```

The default quantized model root directory is:

```text
ERNIE-Image-NF4/
```

## Inference Demo

Below is a minimal runnable example:

```python
from pathlib import Path

import torch
from diffusers import AutoModel, ErnieImagePipeline
from transformers import AutoModel as TransformersAutoModel
from transformers import AutoModelForCausalLM

model_dir = Path("ERNIE-Image-NF4")

transformer = AutoModel.from_pretrained(
    str(model_dir / "transformer"),
    dtype=torch.bfloat16,
    device_map={"": 0},
    local_files_only=True,
)

text_encoder = TransformersAutoModel.from_pretrained(
    str(model_dir / "text_encoder"),
    dtype=torch.bfloat16,
    device_map={"": 0},
    local_files_only=True,
)

pe = AutoModelForCausalLM.from_pretrained(
    str(model_dir / "pe"),
    dtype=torch.bfloat16,
    device_map={"": 0},
    local_files_only=True,
)

pipe = ErnieImagePipeline.from_pretrained(
    str(model_dir),
    transformer=transformer,
    text_encoder=text_encoder,
    pe=pe,
    dtype=torch.bfloat16,
    local_files_only=True,
)
pipe.vae.to("cuda", dtype=torch.bfloat16)

image = pipe(
    prompt="一只橙色的小猫坐在木桌上，前面放着一张写有 Hello Ernie-Image NF4 的白纸，柔和自然光，写实风格。",
    height=512,
    width=512,
    num_inference_steps=50,
    guidance_scale=4.0,
    use_pe=True,
    generator=torch.Generator(device="cuda").manual_seed(1234),
).images[0]

image.save(model_dir / "demo_output.png")
```

# Reference
- [ERNIE-Image](https://huggingface.co/baidu/ERNIE-Image)
- [BitsAndBytes](https://hugging-face.cn/docs/transformers/quantization/bitsandbytes)
- [Diffusers](https://github.com/huggingface/diffusers)