---
license: apache-2.0
library_name: diffusers
pipeline_tag: text-to-image
tags:
- text-to-image
- diffusers
- ernie-image
- nf4
- quantized
- bitsandbytes
base_model: baidu/ERNIE-Image
---

# ERNIE-Image-NF4

`ERNIE-Image-NF4` is a BitsAndBytes 4-bit NF4 quantized version of `ERNIE-Image`. The goal is to preserve the original inference workflow as much as possible while significantly reducing model size and deployment cost.  

<div align="center">
  <img src="https://ai-studio-static-online.cdn.bcebos.com/6199dc6e04384c83a45f80884f40698bf76880207f364c8192a5df3bc6c226d9" />
</div>

> ⚠️ **Quality Disclaimer:** This quantized model has **NOT** been evaluated for generation quality degradation. No systematic side-by-side comparison (e.g., FID, CLIP Score, human evaluation) has been conducted between the original `ERNIE-Image` and `ERNIE-Image-NF4`. NF4 quantization may introduce noticeable artifacts, loss of fine detail, reduced prompt adherence, or other quality regressions — especially in challenging scenarios such as complex compositions, small text rendering, or subtle color gradients. Users are strongly advised to run their own quality evaluations before relying on this model for production or quality-sensitive use cases.

## Quantization Setup

This project uses the BitsAndBytes quantization path supported by Hugging Face Diffusers and Transformers:

- Quantization type: `4-bit NF4`
- Double quantization: `bnb_4bit_use_double_quant=True`
- Compute dtype: `bfloat16`
- Quantized components: `transformer`, `text_encoder`, `pe`
- Components kept in original precision: `vae`, `scheduler`, `tokenizer`, `pe_tokenizer`

Quantization metadata is stored in:

- `quantization_metadata.json`

## Performance Experiment

Experiment setup:

- GPU: `NVIDIA GeForce RTX 5090`
- Prompt: `A realistic orange kitten sits on a wooden desk with a white sheet of paper that says Hello ERNIE-Image NF4, soft natural lighting, centered composition.`
- Resolution: `512x512`
- Inference steps: `50`
- `guidance_scale=4.0`
- `seed=1234`
- Original model execution: `enable_model_cpu_offload()`
- Quantized model execution: quantized `transformer`, `text_encoder`, and `pe` on GPU, with `vae` kept in `bfloat16`

Results:

| Metric | Original ERNIE-Image | ERNIE-Image-NF4 | Observation |
|---|---:|---:|---|
| Load time | 3.60 s | 7.70 s | Quantized model still loads slower |
| Inference time | 50.52 s | 23.23 s | Quantized model is about `2.17x` faster |
| Total time | 54.12 s | 30.93 s | Quantized model is about `1.75x` faster overall |
| Peak reserved VRAM | 15.62 GiB | 9.95 GiB | Peak memory drops by about `36.29%` |

### 10 Consecutive Inference Benchmark

To better reflect sustained generation throughput, an additional benchmark was run on the current machine with one-time model loading followed by `10` consecutive generations under identical settings.

Experiment setup:

- GPU: `NVIDIA GeForce RTX 5090`
- Prompt: `A realistic orange kitten sits on a wooden desk with a white sheet of paper that says Hello ERNIE-Image NF4, soft natural lighting, centered composition.`
- Resolution: `512x512`
- Inference steps: `50`
- `guidance_scale=4.0`
- Base seed: `1234`, incremented by `1` for each image
- Original model execution: `enable_model_cpu_offload()`
- Quantized model execution: quantized `transformer`, `text_encoder`, and `pe` on GPU, with `vae` kept in `bfloat16`

Results:

| Metric | Original ERNIE-Image | ERNIE-Image-NF4 | Observation |
|---|---:|---:|---|
| Load time | 3.72 s | 7.77 s | Quantized model still loads slower |
| Total time for 10 images | 470.70 s | 241.54 s | Quantized model is about `1.95x` faster |
| Average inference time per image | 47.07 s | 24.15 s | Quantized model is about `1.95x` faster |
| Peak reserved VRAM | 15.66 GiB | 10.01 GiB | Reserved memory drops by about `36.07%` |
| Peak allocated VRAM | 15.48 GiB | 9.25 GiB | Allocated memory drops by about `40.23%` |

Interpretation:

- The original model still needs CPU offload in this setup, so inference is noticeably slower
- The quantized model has a more complex cold start, so load time remains longer
- Once sampling starts, the quantized model is still significantly faster on the current machine
- Lower peak VRAM remains one of the main reasons why the quantized version is easier to run efficiently

## Inference Demo

Below is a minimal runnable example:

```python
from pathlib import Path

import torch
from diffusers import AutoModel, ErnieImagePipeline
from transformers import AutoModel as TransformersAutoModel
from transformers import AutoModelForCausalLM

model_dir = Path("ERNIE-Image-NF4")

transformer = AutoModel.from_pretrained(
    str(model_dir / "transformer"),
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    local_files_only=True,
)

text_encoder = TransformersAutoModel.from_pretrained(
    str(model_dir / "text_encoder"),
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    local_files_only=True,
)

pe = AutoModelForCausalLM.from_pretrained(
    str(model_dir / "pe"),
    torch_dtype=torch.bfloat16,
    device_map={"": 0},
    local_files_only=True,
)

pipe = ErnieImagePipeline.from_pretrained(
    str(model_dir),
    transformer=transformer,
    text_encoder=text_encoder,
    pe=pe,
    torch_dtype=torch.bfloat16,
    local_files_only=True,
)
pipe.vae.to("cuda", dtype=torch.bfloat16)

image = pipe(
    prompt="一只橙色的小猫坐在木桌上，前面放着一张写有 Hello Ernie-Image NF4 的白纸，纸张和文字在画面中布局居中，柔和自然光，写实风格。",
    height=512,
    width=512,
    num_inference_steps=50,
    guidance_scale=4.0,
    use_pe=True,
    generator=torch.Generator(device="cuda").manual_seed(1234),
).images[0]

image.save(model_dir / "demo_output.png")
```

# Reference
- [ERNIE-Image](https://huggingface.co/baidu/ERNIE-Image)
- [BitsAndBytes](https://hugging-face.cn/docs/transformers/quantization/bitsandbytes)
- [Diffusers](https://github.com/huggingface/diffusers)