--- license: apache-2.0 base_model: baidu/ERNIE-Image-Turbo pipeline_tag: text-to-image library_name: diffusers tags: - text-to-image - diffusers - safetensors - ernie-image - sdnq - quantized - uint4 - static - quantized-matmul --- # ERNIE-Image-Turbo SDNQ UINT4 Static This is a 4-bit SDNQ static quantization of [baidu/ERNIE-Image-Turbo](https://huggingface.co/baidu/ERNIE-Image-Turbo). The published SDNQ configs set `use_quantized_matmul=true` for `pe`, `text_encoder`, `transformer`, and the pipeline-level config. For current SDNQ/Diffusers builds, enable quantized matmul explicitly after loading with `apply_sdnq_options_to_model`; the serialized flag is retained in metadata, but may not be applied automatically by `from_pretrained()`. ## Recipe - Base model: `baidu/ERNIE-Image-Turbo` - Quantizer: `sdnq` / SDNQ UINT4 static, `dequantize_fp32=false` - Quantized components: `pe`, `text_encoder`, `transformer` - Runtime validation: `use_quantized_matmul=true` - Validation GPU: NVIDIA RTX 6000 Ada Generation - Validation settings: 10 fixed prompt/seed pairs, 8 inference steps, guidance scale 1.0, `use_pe=False` - Runtime note: do not set `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32` for this pipeline; it caused allocator over-reservation and much slower denoising in validation. - Machine-readable runtime recommendations are stored in `runtime_config.json`. `use_pe=False` is used for the headline validation table to compare the image models directly. Stage-level debugging showed that `use_pe=True` can dominate latency: on the `1200x896` technical-diagram prompt, `pe.forward` accounted for most of the runtime, while the denoising transformer was much smaller. ## Measured Results | Model | PE | Load s | Load peak VRAM MiB | Cold inference s | Cold peak VRAM MiB | Hot mean s/img | Hot median s/img | Hot peak VRAM MiB | |---|---:|---:|---:|---:|---:|---:|---:|---:| | Original BF16 | off | 91.84 | 29692 | 7.67 | 34840 | 7.69 | 7.67 | 34932 | | SDNQ UINT4 static, serialized config path | off | 71.84 | 10172 | 16.10 | 15254 | 11.15 | 12.26 | 15390 | The row above is preserved for reproducibility of the original validation run. A follow-up profiling pass found that current loaders may leave quantized matmul disabled unless it is applied explicitly after loading. ### Explicit Quantized-Matmul Runtime With explicit `apply_sdnq_options_to_model(..., use_quantized_matmul=True)`, default PyTorch CUDA allocator settings, and no `torch.cuda.empty_cache()` between hot generations: | Runtime | PE | Cold s | Hot mean s/img | Hot median s/img | Hot range s/img | Hot peak torch reserved MiB | Hot peak torch allocated MiB | |---|---:|---:|---:|---:|---:|---:|---:| | SDNQ UINT4 static + explicit qmm | off | 8.34 | 6.08 | 5.81 | 5.55-6.94 | 19540 | 19391 | The slow component with PE disabled is the denoising transformer. In the corrected qmm profile, `transformer.forward` accounts for roughly `5.0-5.4s` of a `5.8-7.0s` hot generation on RTX 6000 Ada. `text_encoder.forward` is about `0.55-0.65s` after warmup, and `vae.decode` is usually about `0.15s`. The allocator pitfall is large: with `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:32`, the same explicit-qmm runtime reserved about `48 GiB` and measured `25.88s` hot median with `empty_cache=True`, or `15.86s` without `empty_cache`. ## Visual Comparison [![Original BF16 vs SDNQ UINT4 static + quantized matmul, PE off](comparison/original_vs_sdnq_uint4_static_qmm_peoff_matrix.webp)](comparison/original_vs_sdnq_uint4_static_qmm_peoff_matrix.webp) Individual prompt pairs are stored in `comparison/`, and full metrics are stored in `metrics/`. ## Usage ```python import torch import sdnq # registers SDNQ support from diffusers import ErnieImagePipeline from sdnq.loader import apply_sdnq_options_to_model pipe = ErnieImagePipeline.from_pretrained( "WaveCut/ERNIE-Image-Turbo-SDNQ-uint4-static", torch_dtype=torch.bfloat16, ).to("cuda") for name in ("pe", "text_encoder", "transformer"): component = getattr(pipe, name, None) if component is not None: setattr(pipe, name, apply_sdnq_options_to_model(component, use_quantized_matmul=True)) image = pipe( prompt="A clean modern poster with readable Cyrillic typography", width=1024, height=1024, num_inference_steps=8, guidance_scale=1.0, use_pe=False, ).images[0] ``` If you need maximum throughput, keep the model resident and avoid calling `torch.cuda.empty_cache()` between requests. You can confirm the runtime state after loading: ```python for name in ("pe", "text_encoder", "transformer"): qcfg = getattr(getattr(pipe, name, None), "quantization_config", None) print(name, getattr(qcfg, "use_quantized_matmul", None)) ``` ## Prompt Set | # | Prompt ID | Size | Seed | Focus | |---:|---|---:|---:|---| | 00 | `00-cyrillic-poster` | 1024x1024 | 41001 | Cyrillic event poster | | 01 | `01-long-text-bakery-ad` | 896x1200 | 41002 | Long text product ad | | 02 | `02-technical-diagram` | 1200x896 | 41003 | Technical diagram | | 03 | `03-four-panel-comic` | 1024x1024 | 41004 | Four-panel comic | | 04 | `04-public-domain-painter-fusion` | 1024x1024 | 41005 | Painterly style fusion | | 05 | `05-dashboard-ui` | 1376x768 | 41006 | Dense UI dashboard | | 06 | `06-glass-still-life` | 1024x1024 | 41007 | Glass and reflections | | 07 | `07-botanical-field-guide` | 896x1200 | 41008 | Field guide plate | | 08 | `08-restaurant-menu-board` | 1024x1024 | 41009 | Menu board text | | 09 | `09-isometric-city-map` | 1200x896 | 41010 | Isometric map | ## Notes - The comparison uses the same prompts, dimensions, seeds, 8 inference steps, and guidance scale for both original and quantized runs. - `use_pe=True` remains supported by the pipeline, but it measures prompt-enhancer behavior in addition to image generation. - Corrected qmm runtime metrics are stored in `metrics/ernie_uint4_qmm_explicit_default_allocator_8step_metrics.json`; allocator-debug metrics are stored in `metrics/runtime_allocator_debug_metrics.json`. - This is an independent quantized artifact; see the original Baidu model card for upstream model details, benchmarks, and license terms.