Update concise model card

ed71b3f verified 9 days ago

8.56 kB

	---
	language:
	- en
	license: other
	license_name: flux-non-commercial-license
	license_link: LICENSE.md
	base_model:
	- black-forest-labs/FLUX.2-klein-9B
	base_model_relation: quantized
	library_name: diffusers
	pipeline_tag: text-to-image
	tags:
	- image-generation
	- image-editing
	- flux
	- flux2
	- Flux2KleinPipeline
	- sdnq
	- 4-bit
	- uint4
	- quantized
	- diffusers
	---

	# FLUX.2 Klein 9B SDNQ UINT4 Static

	Static UINT4 SDNQ quantization of
	[black-forest-labs/FLUX.2-klein-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-9B).

	This checkpoint was selected as a practical deployment-oriented variant because
	it was the fastest option in the A40 benchmark and used substantially less VRAM
	than the original BF16 pipeline, while visual quality differences were minor in
	the prompt-following stress comparison.

	Related checkpoint: for a quality-oriented dynamic SVD alternative with a
	modest latency and VRAM tradeoff, see
	[WaveCut/FLUX.2-klein-9B-SDNQ-float4_e4m0fnu-dynamic-th0p01-svd-r128-s32](https://huggingface.co/WaveCut/FLUX.2-klein-9B-SDNQ-float4_e4m0fnu-dynamic-th0p01-svd-r128-s32).

	![Full resolution comparison canvas](assets/flux2-sdnq-uint4-static-comparison.webp)

	The image above is a compressed WebP version of a 1:1 comparison canvas. It
	contains the original FLUX.2 Klein 9B, the previous SDNQ baseline, this
	`uint4-static` checkpoint, and a quality-oriented dynamic SVD candidate across
	text-heavy prompts including an additional Russian-only chalkboard prompt.

	## Why This Variant

	We compared broad SDNQ 4-bit recipes across speed, VRAM, and visual quality.
	This `uint4-static` recipe was chosen because it gives the best deployment
	tradeoff:

	- Lowest latency among the final candidates in the single-process benchmark.
	- Low runtime VRAM in a 1024x1024, 4-step image-generation pipeline.
	- Much smaller full-pipeline checkpoint footprint than the original BF16
	FLUX.2 Klein 9B checkpoint in the measured setup.
	- Visual differences versus the baseline and the original model were small in
	the stress set, including long text, signs, labels, small details, and a
	Russian chalkboard prompt.

	## Benchmark Setup

	Measurements below use a single NVIDIA A40 test host and a consistent
	`Flux2KleinPipeline` inference harness.

	- GPU: NVIDIA A40 46 GB
	- Resolution: 1024x1024
	- Steps: 4
	- Guidance scale: 0.0
	- Torch dtype: bfloat16
	- Quantized matmul: enabled for SDNQ inference comparisons
	- Batch/concurrency: single process

	These are deployment-oriented measurements for one hardware/software setup.

	## Candidate Benchmark

	Single-process inference metrics for the final candidate set:

	\| Variant \| Warm avg \| GPU peak \| CUDA allocated \|
	\| --- \| ---: \| ---: \| ---: \|
	\| `uint4-static` \| 3.826 s \| 14.8 GB \| 14.1 GB \|
	\| `int4-dynamic-th0p1-svd-r16-s32-g128` \| 4.020 s \| 14.3 GB \| 13.5 GB \|
	\| `uint4-static-svd-r32-s32` \| 4.070 s \| 14.7 GB \| 13.9 GB \|
	\| `float4_e4m0fnu-dynamic-th0p1-svd-r16-s32` \| 4.116 s \| 16.0 GB \| 15.3 GB \|
	\| `float4_e4m0fnu-dynamic-th0p01-svd-r128-s32` \| 4.185 s \| 17.2 GB \| 16.5 GB \|

	## Stress Comparison

	This stress set contains 9 prompts with signs, chalkboards, posters, labels,
	timetables, small props, and a Russian-only chalkboard prompt. Each row was run
	twice; the table reports the warm run average.

	\| Model \| Warm avg \| GPU peak \| CUDA allocated \| Prompt count \|
	\| --- \| ---: \| ---: \| ---: \| ---: \|
	\| Original `FLUX.2-klein-9B` BF16 pipeline \| 4.244 s \| 36.3 GB \| 35.6 GB \| 9 \|
	\| Previous SDNQ baseline \| 4.079 s \| 15.2 GB \| 14.5 GB \| 9 \|
	\| This `uint4-static` checkpoint \| 3.866 s \| 14.8 GB \| 14.1 GB \| 9 \|
	\| Dynamic SVD r128 quality candidate \| 4.182 s \| 17.2 GB \| 16.5 GB \| 9 \|

	The model-card image is a WebP copy optimized from the full-resolution
	comparison canvas:

	\| WebP quality \| Size \| RGB PSNR \| Luma SSIM-like score \|
	\| ---: \| ---: \| ---: \| ---: \|
	\| 85 \| 5.72 MB \| 46.93 dB \| 0.999977 \|

	The source JPEG canvas was about 13 MB; this WebP version is smaller while
	remaining visually close to the original artifact.

	## Model Size

	Approximate full-pipeline folder sizes in the measured setup:

	\| Checkpoint \| Folder size \|
	\| --- \| ---: \|
	\| Original `black-forest-labs/FLUX.2-klein-9B` \| 52.9 GB \|
	\| Previous SDNQ baseline \| 12.6 GB \|
	\| This `uint4-static` checkpoint \| 12.2 GB \|
	\| Dynamic SVD r128 candidate \| 14.7 GB \|

	## Usage

	Install current Diffusers and SDNQ:

	```bash
	pip install git+https://github.com/huggingface/diffusers.git
	pip install sdnq
	```

	Run with `Flux2KleinPipeline`:

	```python
	import torch
	from diffusers import Flux2KleinPipeline
	from sdnq import SDNQConfig # registers SDNQ support in diffusers/transformers
	from sdnq.common import use_torch_compile as triton_is_available
	from sdnq.loader import apply_sdnq_options_to_model

	repo_id = "WaveCut/FLUX.2-klein-9B-SDNQ-uint4-static"
	device = "cuda"

	pipe = Flux2KleinPipeline.from_pretrained(
	repo_id,
	torch_dtype=torch.bfloat16,
	)

	if triton_is_available and torch.cuda.is_available():
	pipe.transformer = apply_sdnq_options_to_model(
	pipe.transformer,
	use_quantized_matmul=True,
	)
	pipe.text_encoder = apply_sdnq_options_to_model(
	pipe.text_encoder,
	use_quantized_matmul=True,
	)

	pipe.to(device)

	prompt = "A clean editorial poster with large readable text: OPEN SOURCE IMAGE MODEL"
	image = pipe(
	prompt=prompt,
	height=1024,
	width=1024,
	num_inference_steps=4,
	guidance_scale=0.0,
	generator=torch.Generator(device=device).manual_seed(0),
	).images[0]

	image.save("flux2-klein-sdnq-uint4-static.png")
	```

	The same pipeline also supports image editing:

	```python
	from diffusers.utils import load_image

	input_image = load_image("input.png")
	image = pipe(
	image=input_image,
	prompt="Turn the handwritten sign into a clean printed sign while preserving the scene",
	height=1024,
	width=1024,
	num_inference_steps=4,
	guidance_scale=0.0,
	generator=torch.Generator(device=device).manual_seed(1),
	).images[0]
	image.save("flux2-klein-sdnq-uint4-static-edit.png")
	```

	If your GPU has less VRAM, replace `pipe.to(device)` with
	`pipe.enable_model_cpu_offload()`.

	## Quantization Recipe

	This checkpoint was produced with SDNQ post-load quantization over the
	`transformer` and `text_encoder` components of FLUX.2 Klein 9B.

	Recipe:

	```python
	variant = {
	"weights_dtype": "uint4",
	"use_dynamic_quantization": False,
	"dynamic_loss_threshold": None,
	"use_svd": False,
	"svd_rank": 32, # unused because use_svd is False
	"svd_steps": 8, # unused because use_svd is False
	"group_size": 0,
	"dequantize_fp32": False,
	"quantized_matmul_dtype": None,
	"use_quantized_matmul": False,
	"use_stochastic_rounding": False,
	}
	```

	Minimal quantization sketch:

	```python
	import torch
	from diffusers import Flux2KleinPipeline
	from sdnq import sdnq_post_load_quant
	from sdnq.loader import save_sdnq_model

	base_model = "black-forest-labs/FLUX.2-klein-9B"
	pipe = Flux2KleinPipeline.from_pretrained(
	base_model,
	torch_dtype=torch.bfloat16,
	)

	common_kwargs = dict(
	weights_dtype="uint4",
	torch_dtype=torch.bfloat16,
	group_size=0,
	svd_rank=32,
	svd_steps=8,
	dynamic_loss_threshold=None,
	use_svd=False,
	quant_conv=False,
	quant_embedding=False,
	use_quantized_matmul=False,
	use_quantized_matmul_conv=False,
	use_dynamic_quantization=False,
	use_stochastic_rounding=False,
	dequantize_fp32=False,
	non_blocking=True,
	add_skip_keys=True,
	quantization_device="cuda",
	return_device="cuda",
	)

	pipe.transformer = sdnq_post_load_quant(pipe.transformer, **common_kwargs)
	pipe.text_encoder = sdnq_post_load_quant(pipe.text_encoder, **common_kwargs)

	save_sdnq_model(
	pipe,
	"FLUX.2-klein-9B-SDNQ-uint4-static",
	max_shard_size="5GB",
	is_pipeline=True,
	)
	```

	## Limitations

	- This is a quantized derivative of FLUX.2 Klein 9B; it inherits the base
	model's limitations and acceptable-use requirements.
	- Text rendering can still be inaccurate, especially for long strings or small
	background text.
	- The quality comparison here is visual prompt-following evaluation, not a
	large-scale human preference or FID benchmark.
	- Benchmarks were run on an A40 test host and should be validated again for
	your exact serving stack.

	## License

	This model is a quantized derivative of
	[black-forest-labs/FLUX.2-klein-9B](https://huggingface.co/black-forest-labs/FLUX.2-klein-9B)
	and follows the FLUX Non-Commercial License. Please review `LICENSE.md` and the
	Black Forest Labs acceptable-use policy before use.