---
license: apache-2.0
library_name: diffusers
pipeline_tag: text-to-image
tags:
- text-to-image
- diffusers
- ernie-image
- nf4
- quantized
- bitsandbytes
base_model: baidu/ERNIE-Image
---
# ERNIE-Image-NF4
`ERNIE-Image-NF4` is a BitsAndBytes 4-bit NF4 quantized version of `ERNIE-Image`. The goal is to preserve the original inference workflow as much as possible while significantly reducing model size and deployment cost.
> ⚠️ **Quality Disclaimer:** This quantized model has **NOT** been evaluated for generation quality degradation. No systematic side-by-side comparison (e.g., FID, CLIP Score, human evaluation) has been conducted between the original `ERNIE-Image` and `ERNIE-Image-NF4`. NF4 quantization may introduce noticeable artifacts, loss of fine detail, reduced prompt adherence, or other quality regressions — especially in challenging scenarios such as complex compositions, small text rendering, or subtle color gradients. Users are strongly advised to run their own quality evaluations before relying on this model for production or quality-sensitive use cases.
## Quantization Setup
This project uses the BitsAndBytes quantization path supported by Hugging Face Diffusers and Transformers:
- Quantization type: `4-bit NF4`
- Double quantization: `bnb_4bit_use_double_quant=True`
- Compute dtype: `bfloat16`
- Quantized components: `transformer`, `text_encoder`, `pe`
- Components kept in original precision: `vae`, `scheduler`, `tokenizer`, `pe_tokenizer`
Quantization metadata is stored in:
- `quantization_metadata.json`
## Performance Experiment
Experiment setup:
- GPU: `NVIDIA GeForce RTX 5090`
- Prompt: `A realistic orange kitten sits on a wooden desk with a white sheet of paper that says Hello ERNIE-Image NF4, soft natural lighting, centered composition.`
- Resolution: `512x512`
- Inference steps: `50`
- `guidance_scale=4.0`
- `seed=1234`
- Original model execution: `enable_model_cpu_offload()`
- Quantized model execution: quantized `transformer`, `text_encoder`, and `pe` on GPU, with `vae` kept in `bfloat16`
Results:
| Metric | Original ERNIE-Image | ERNIE-Image-NF4 | Observation |
|---|---:|---:|---|
| Load time | 3.60 s | 7.70 s | Quantized model still loads slower |
| Inference time | 50.52 s | 23.23 s | Quantized model is about `2.17x` faster |
| Total time | 54.12 s | 30.93 s | Quantized model is about `1.75x` faster overall |
| Peak reserved VRAM | 15.62 GiB | 9.95 GiB | Peak memory drops by about `36.29%` |
### 10 Consecutive Inference Benchmark
To better reflect sustained generation throughput, an additional benchmark was run on the current machine with one-time model loading followed by `10` consecutive generations under identical settings.
Experiment setup:
- GPU: `NVIDIA GeForce RTX 5090`
- Prompt: `A realistic orange kitten sits on a wooden desk with a white sheet of paper that says Hello ERNIE-Image NF4, soft natural lighting, centered composition.`
- Resolution: `512x512`
- Inference steps: `50`
- `guidance_scale=4.0`
- Base seed: `1234`, incremented by `1` for each image
- Original model execution: `enable_model_cpu_offload()`
- Quantized model execution: quantized `transformer`, `text_encoder`, and `pe` on GPU, with `vae` kept in `bfloat16`
Results:
| Metric | Original ERNIE-Image | ERNIE-Image-NF4 | Observation |
|---|---:|---:|---|
| Load time | 3.72 s | 7.77 s | Quantized model still loads slower |
| Total time for 10 images | 470.70 s | 241.54 s | Quantized model is about `1.95x` faster |
| Average inference time per image | 47.07 s | 24.15 s | Quantized model is about `1.95x` faster |
| Peak reserved VRAM | 15.66 GiB | 10.01 GiB | Reserved memory drops by about `36.07%` |
| Peak allocated VRAM | 15.48 GiB | 9.25 GiB | Allocated memory drops by about `40.23%` |
Interpretation:
- The original model still needs CPU offload in this setup, so inference is noticeably slower
- The quantized model has a more complex cold start, so load time remains longer
- Once sampling starts, the quantized model is still significantly faster on the current machine
- Lower peak VRAM remains one of the main reasons why the quantized version is easier to run efficiently
## Inference Demo
Below is a minimal runnable example:
```python
from pathlib import Path
import torch
from diffusers import AutoModel, ErnieImagePipeline
from transformers import AutoModel as TransformersAutoModel
from transformers import AutoModelForCausalLM
model_dir = Path("ERNIE-Image-NF4")
transformer = AutoModel.from_pretrained(
str(model_dir / "transformer"),
torch_dtype=torch.bfloat16,
device_map={"": 0},
local_files_only=True,
)
text_encoder = TransformersAutoModel.from_pretrained(
str(model_dir / "text_encoder"),
torch_dtype=torch.bfloat16,
device_map={"": 0},
local_files_only=True,
)
pe = AutoModelForCausalLM.from_pretrained(
str(model_dir / "pe"),
torch_dtype=torch.bfloat16,
device_map={"": 0},
local_files_only=True,
)
pipe = ErnieImagePipeline.from_pretrained(
str(model_dir),
transformer=transformer,
text_encoder=text_encoder,
pe=pe,
torch_dtype=torch.bfloat16,
local_files_only=True,
)
pipe.vae.to("cuda", dtype=torch.bfloat16)
image = pipe(
prompt="一只橙色的小猫坐在木桌上,前面放着一张写有 Hello Ernie-Image NF4 的白纸,纸张和文字在画面中布局居中,柔和自然光,写实风格。",
height=512,
width=512,
num_inference_steps=50,
guidance_scale=4.0,
use_pe=True,
generator=torch.Generator(device="cuda").manual_seed(1234),
).images[0]
image.save(model_dir / "demo_output.png")
```
# Reference
- [ERNIE-Image](https://huggingface.co/baidu/ERNIE-Image)
- [BitsAndBytes](https://hugging-face.cn/docs/transformers/quantization/bitsandbytes)
- [Diffusers](https://github.com/huggingface/diffusers)