# PiD — Pixel Diffusion Decoder
> **TL;DR** — PiD is a plug-and-play diffusion decoder that replaces VAE/RAE decoders, turning latent representations directly into super-resolved pixels in a single pass.
https://github.com/user-attachments/assets/a556e2d4-5de5-4bcf-9daa-80f7ea6b2124
PiD reformulates the latent-to-pixel decoder as a conditional pixel-space diffusion
model, unifying decoding and upsampling into a single generative module.
It directly denoises in high-resolution pixel
space and produces a super-resolved image in one pass.
**[Paper](https://arxiv.org/abs/2605.23902), [Project Page](https://research.nvidia.com/labs/sil/projects/pid/), [Model Weights](https://huggingface.co/nvidia/PiD)**
[Yifan Lu](https://yifanlu0227.github.io/),
[Qi Wu](https://wilsoncernwq.github.io/),
[Jay Zhangjie Wu](https://zhangjiewu.github.io/),
[Zian Wang](https://www.cs.toronto.edu/~zianwang/),
[Huan Ling](https://www.cs.toronto.edu/~linghuan/),
[Sanja Fidler](https://www.cs.utoronto.ca/~fidler/),
[Xuanchi Ren](https://xuanchiren.com/)
## Installation
> [!TIP]
> **Quick Start** — if your environment already has PyTorch (with CUDA), `transformers>=4.57.x`, and `diffusers>=0.37`, you don't need to build a new conda env. Just install the small set of utility deps the inference code pulls eagerly and you're ready to run the diffusers backbones (`flux`/`flux2`/`sd3`/`zimage`):
>
> ```bash
> pip install hydra-core==1.3.2 omegaconf==2.3.0 \
> attrs einops loguru termcolor fvcore iopath pynvml wandb \
> imageio opencv-python-headless pandas \
> safetensors "huggingface-hub>=1.0" sentencepiece boto3 botocore
> pip install -e .
> ```
>
> For the `dinov2` / `siglip` backbones you additionally need the upstream RAE / Scale-RAE repos plus a couple of extra packages — see [docs/dinov2_siglip.md](docs/dinov2_siglip.md).
Full conda-managed install (preferred if you're starting from scratch):
```bash
conda env create -f environment.yml
conda activate pid
# 2. Install this package in editable mode.
pip install -e .
```
## Checkpoints and assets
Pretrained PiD checkpoints live under `checkpoints/`. Each diffusers backbone ships
two variants — the original `2k` decoder (trained at 2048px) and a `2kto4k` decoder
(trained with multi-resolution data bucketing 2048→3840 + an SD3-style dynamic
shift, intended for 1024 LDM → 4K decoding). Pick the variant at the CLI via
`--pid_ckpt_type {2k,2kto4k}` (default: `2k`).
### Downloading
The released decoder weights and the encoder/decoder ("VAE") weights they
depend on are hosted at [`nvidia/PiD`](https://huggingface.co/nvidia/PiD) on
the Hugging Face Hub. Pull just the `checkpoints/` tree into this repo:
```bash
hf download nvidia/PiD --local-dir . --include "checkpoints/*"
```
## Running inference
PiD ships two complementary entry points per backbone:
| Backbone | `from_clean_*` (image → encode → PiD) | `from_ldm_*` (text/class → LDM → PiD) |
|----------|---------------------------------------|---------------------------------------|
| flux | `from_clean_flux.py` | `from_ldm_flux.py` |
| flux2 | `from_clean_flux2.py` | `from_ldm_flux2.py` |
| sd3 | `from_clean_sd3.py` | `from_ldm_sd3.py` |
| zimage | reuses `flux` | `from_ldm_zimage.py` |
| dinov2 | `from_clean_dinov2.py` | `from_ldm_dinov2.py` |
| siglip | `from_clean_siglip.py` | `from_ldm_siglip.py` |
All scripts live under `pid/_src/inference/` and decode each captured latent
twice — once with the backbone's native VAE (baseline) and once with PiD.
> [!IMPORTANT]
> Picking the checkpoint variant — `--pid_ckpt_type`
> Every entry point accepts `--pid_ckpt_type {2k,2kto4k}` (default `2k`):
>
> - **`2k`** — the original 2048px-trained decoder.
> - **`2kto4k`** — the up-to-4K-resolution decoder. > > Available for `flux` / `flux2` / `sd3` / `zimage` only. Worse than `2k` at 2048px resolution.
>
> For the exact checkpoint path for each backbone, see [docs/checkpoints.md](docs/checkpoints.md).
> A quick sanity check that the right variant loaded: when `2kto4k` is active you
should see `PixelDiT dynamic shift: base_shift=4.0 base_image_size=1024` in the
init log; for `2k` that line is absent. Both `2k` and `2kto4k` support non-square aspect ratios.
### 📕 `from_ldm_*`: text / class → latent diffusion → PiD decode
Runs the corresponding latent-diffusion backbone on a prompt (or class id for
the class-conditional `dinov2` backbone), captures the intermediate `x_t` at
user-specified denoising steps (early LDM termination) and the final clean `x_0`, then decodes
each captured latent with both the native VAE / RAE decoder (baseline) and PiD.
For `flux` / `flux2` / `sd3` / `zimage` the LDM is a HuggingFace `diffusers`
pipeline (`FluxPipeline`, `Flux2Pipeline`, `StableDiffusion3Pipeline`,
`ZImagePipeline`).
For `dinov2` and `siglip` the LDM is the upstream
[RAE](https://github.com/bytetriper/RAE) (class-conditional ImageNet-512) or
[Scale-RAE](https://github.com/ZitengWangNYU/Scale-RAE) (text-conditional
256px) repo — see the optional-deps section below for installation.
#### Example 1 — Single-GPU, single prompt (Flux, default `2k` decoder)
```bash
PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
--prompt "A photorealistic half-body portrait of a brown tabby cat with bold stripes sitting attentively on a rustic wooden kitchen table, soft morning light streaming sideways through a large window, fine fur detail and stripe patterns sharply visible, intense amber-green eyes in razor-sharp focus, warm farmhouse kitchen softly out of focus, cinematic shallow depth of field, ultra-detailed fur texture, photorealistic" \
--ldm_inference_steps 28 --save_xt_steps 24 \
--output_dir ./results/official_demo/flux \
--cfg_scale 1 --pid_inference_steps 4 --scale 4
```
#### Example 2 — Single-GPU, 4K decode (Flux, `2kto4k` decoder)
Same backbone as Example 1 but with `--resolution 1024 --pid_ckpt_type 2kto4k`,
so the LDM produces a 1024² latent and PiD decodes it to 4K.
```bash
PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
--prompt "A photorealistic half-body portrait of a brown tabby cat with bold stripes sitting attentively on a rustic wooden kitchen table, soft morning light streaming sideways through a large window, fine fur detail and stripe patterns sharply visible, intense amber-green eyes in razor-sharp focus, warm farmhouse kitchen softly out of focus, cinematic shallow depth of field, ultra-detailed fur texture, photorealistic" \
--resolution 1024 --pid_ckpt_type 2kto4k \
--ldm_inference_steps 28 --save_xt_steps 24 \
--output_dir ./results/official_demo/flux_4k \
--cfg_scale 1 --pid_inference_steps 4 --scale 4
```
#### Example 3 — Multi-GPU with a prompt file (Z-Image)
`torchrun` shards `--prompt_file` across ranks; each rank writes to
`--output_dir` independently.
```bash
PYTHONPATH=. torchrun --nproc_per_node=4 \
-m pid._src.inference.from_ldm_zimage \
--prompt_file pid/_src/inference/prompts/prompt_creative.txt \
--ldm_inference_steps 50 --save_xt_steps 46 \
--output_dir ./results/official_demo/zimage \
--cfg_scale 1 --pid_inference_steps 4 --scale 4
```
#### `dinov2` / `siglip` backbones
The upstream RAE / Scale-RAE LDMs don't live in `diffusers` — see
[`docs/dinov2_siglip.md`](docs/dinov2_siglip.md) for setup and end-to-end
examples.
#### Suggested step settings per diffusers backbone
(See each script's docstring for the exact recipe.)
| Backbone | LDM steps flag | Default steps | `--save_xt_steps` (example) | Best `--save_xt_steps` |
|----------|-------------------------|---------------|-----------------------------|----------------------|
| flux | `--ldm_inference_steps` | 28 | `22 24 26` | 24 |
| sd3 | `--ldm_inference_steps` | 28 | `22 24 26` | 24 |
| flux2 | `--ldm_inference_steps` | 50 | `44 46 48` | 46 |
| zimage | `--ldm_inference_steps` | 50 | `44 46 48` | 46 |
---
### 📗 `from_clean_*`: image → VAE encode → PiD decode
No latent diffusion model is run. The input image is encode by VAE,
optionally corrupted with Gaussian noise at each
sigma in `--degrade_sigmas`, then decoded by PiD at `--scale * input_resolution`.
Single-GPU example (Flux):
```bash
PYTHONPATH=. python -m pid._src.inference.from_clean_flux \
--manifest assets/clean_image_manifest.jsonl \
--input_resolution 512 \
--degrade_sigmas 0.0 \
--output_dir ./results/official_demo_from_clean/flux \
--cfg_scale 1 --pid_inference_steps 4 --scale 4
```
You can pass a single image with `--input_path` and a prompt with `--prompt`
instead of `--manifest`, and a sigma sweep such as `--degrade_sigmas 0.0 0.2 0.4 0.8`
to decode noise-corrupted latents.
The `dinov2` / `siglip` `from_clean_*` flows take the same flags but with
different default resolutions and scales —
see [`docs/dinov2_siglip.md`](docs/dinov2_siglip.md).
### Common arguments
| Flag | Meaning |
|------|---------|
| `--pid_inference_steps`| Number of denoising steps for PiD (4 for the released distilled checkpoints) |
| `--scale` | PiD upscale factor (output = `baseline * scale`); 8 for Scale-RAE and 4 for other backbones |
| `--cfg_scale` | Classifier-free guidance scale for PiD |
| `--output_dir` | Where to write the side-by-side comparison images |
| `--seed` | Base random seed |
Multi-GPU runs use `torchrun --nproc_per_node=N`; each rank processes a shard
of the prompts / manifest entries and writes to `--output_dir` independently.
## Repository layout
```
pid/_src/inference/
├── from_ldm_{flux,flux2,sd3,zimage,dinov2,siglip}.py # text/class → LDM → PiD decode
├── from_clean_{flux,flux2,sd3,dinov2,siglip}.py # image → encode → PiD decode
├── _demo_common.py # shared CLI + run loop for from_ldm_*
├── _demo_from_clean_common.py # shared CLI + run loop for from_clean_*
├── checkpoint_registry.py # backbone → PiD checkpoint mapping
├── pipeline_registry.py # diffusers backbone → HF pipeline mapping
├── rae_generation.py # DINOv2-RAE LDM helpers (from_ldm_dinov2)
├── scale_rae_generation.py # Scale-RAE LDM helpers (from_ldm_siglip)
└── prompts/ # prompt files for from_ldm_*
```
## License
PiD codebase is licensed under the [Apache License 2.0](LICENSE).
## Contributing
See [`CONTRIBUTING.md`](CONTRIBUTING.md) for development setup, code style,
and the DCO sign-off requirement.
## Acknowledgments
The authors would like to acknowledge [Yongsheng Yu](https://www.yongshengyu.com/) and [Wei Xiong](https://wxiong.me/) for open-sourcing [PixelDiT](https://pixeldit.github.io/)'s model and weights, and thank Product Managers [Aditya Mahajan](https://www.linkedin.com/in/aditya-mahajan1) and [Matt Cragun](https://www.linkedin.com/in/mcragun/) for their valuable support and guidance.
## Citation
```bibtex
@article{lu2026pid,
title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
journal={arXiv preprint arXiv:2605.23902},
year={2026}
}
```