HiDream-O1-Image-Dev SDNQ - Dynamic UINT4 threshold 1e-2, fixed

Fastest balanced fixed variant. Dynamic quantization keeps the known artifact-producing down/output projections unquantized and raises difficult layers automatically.

This repository is part of the fixed SDNQ 4-bit HiDream O1 quantization set. The previous broad 4-bit recipes produced a visible tiled/grid artifact. The fix keeps the sensitive decoder projection path in higher precision, especially model.language_model.layers.*.mlp.down_proj.weight.

Comparison

Benchmarks were run on an NVIDIA RTX PRO 6000 Blackwell Workstation Edition with the HiDream O1 repository inference path, BF16 runtime, 28 flash-scheduler steps, and flash attention disabled for parity. The requested 1024x1024 size is snapped by O1 to 2048x2048.

Model	Best for	Avg gen s	Gen time vs BF16	Peak alloc GiB	VRAM saved	Param storage GiB	Storage saved	Quantized layers	Quantized params B
Original BF16	Baseline quality/reference	8.20	-	17.11	-	16.40	-	-	0.00
Dynamic UINT4 threshold 1e-2, fixed	Fast balanced	9.03	+10%	10.60	+38%	9.87	+40%	int5:31, uint4:265	5.10
Static UINT4 + SVD r32, o/down BF16 guard	Safe default	9.17	+12%	10.46	+39%	9.71	+41%	uint4:296	5.10
Static UINT4 + SVD r32, down_proj BF16	Minimal fix	9.24	+13%	9.66	+44%	8.92	+46%	uint4:332	5.71
Static UINT4 + SVD r32, last 8 o/down BF16	Lowest VRAM	9.45	+15%	7.98	+53%	7.23	+56%	uint4:352	6.98
Static UINT4 + SVD r32, last 16 o/down BF16	Memory/quality	9.35	+14%	8.70	+49%	7.94	+52%	uint4:336	6.44

Variant Strengths

Dynamic UINT4 threshold 1e-2, fixed: Fastest balanced fixed variant. Dynamic quantization keeps the known artifact-producing down/output projections unquantized and raises difficult layers automatically.
Static UINT4 + SVD r32, o/down BF16 guard: Conservative default. Keeps both attention output and MLP down projections in BF16, the visually accepted fix for the tiled-grid artifact.
Static UINT4 + SVD r32, down_proj BF16: Smallest root-cause fix. Only MLP down projections are kept in BF16 beyond the standard output/embed skips; this isolates down_proj as the main grid culprit.
Static UINT4 + SVD r32, last 8 o/down BF16: Most memory-efficient clean-looking compromise from the matrix. It protects only the last 8 decoder layers' o/down projections.
Static UINT4 + SVD r32, last 16 o/down BF16: Safer memory-efficient compromise. It protects the last 16 decoder layers' o/down projections and keeps much lower storage than the full o/down guard.

This Variant

Source model: HiDream-ai/HiDream-O1-Image-Dev
Source snapshot: 833d408a57a7c1e399757c7f2f174670726fd43c
Recipe: pub_dynamic_uint4_th1e2_fixed
SDNQ layer counts: {"int5": 31, "uint4": 265}
Quantized parameter counts: {"int5": 151750656, "uint4": 4949600256}
Benchmark average generation time: 9.03s
Benchmark peak allocated VRAM: 10.60 GiB
Saved parameter storage: 9.87 GiB
10-demo average generation time: 9.44s
10-demo peak allocated VRAM: 10.60 GiB

Demo Comparisons

Each image in comparison/ is a pairwise original BF16 output next to this quantized variant with the same prompt, seed, and sampler settings.

Usage

pip install sdnq torch transformers diffusers accelerate einops pillow scipy torchvision
git clone https://github.com/HiDream-ai/HiDream-O1-Image
cd HiDream-O1-Image

import torch
import sdnq
from transformers import AutoProcessor
from models.qwen3_vl_transformers import Qwen3VLForConditionalGeneration

model_id = "WaveCut/HiDream-O1-Image-Dev-SDNQ-4bit-dynamic-uint4-th1e-2"
processor = AutoProcessor.from_pretrained(model_id)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="cuda",
).eval()

Files

quantization_config.json - saved SDNQ config.
quantization_summary.json - quantized layer/parameter/storage summary.
benchmark_summary.json - matrix metrics plus 10-demo generation metrics.
comparison/00.jpg ... comparison/09.jpg - pairwise original vs quantized comparisons.
comparison/contact_sheet.jpg - compact overview of all 10 comparisons.