File size: 11,657 Bytes
e2ddf3f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
# PiD β€” Pixel Diffusion Decoder

> **TL;DR** β€” PiD is a plug-and-play diffusion decoder that replaces VAE/RAE decoders, turning latent representations directly into super-resolved pixels in a single pass.

<p align="center">
  <img src="figures/teaser.jpg" alt="PiD teaser" width="100%">
</p>

https://github.com/user-attachments/assets/a556e2d4-5de5-4bcf-9daa-80f7ea6b2124

PiD reformulates the latent-to-pixel decoder as a conditional pixel-space diffusion
model, unifying decoding and upsampling into a single generative module.
It directly denoises in high-resolution pixel
space and produces a super-resolved image in one pass.

**[Paper](https://arxiv.org/abs/2605.23902), [Project Page](https://research.nvidia.com/labs/sil/projects/pid/), [Model Weights](https://huggingface.co/nvidia/PiD)**

[Yifan Lu](https://yifanlu0227.github.io/),
[Qi Wu](https://wilsoncernwq.github.io/),
[Jay Zhangjie Wu](https://zhangjiewu.github.io/),
[Zian Wang](https://www.cs.toronto.edu/~zianwang/),
[Huan Ling](https://www.cs.toronto.edu/~linghuan/),
[Sanja Fidler](https://www.cs.utoronto.ca/~fidler/),
[Xuanchi Ren](https://xuanchiren.com/) <br>

## Installation

> [!TIP]
> **Quick Start** β€” if your environment already has PyTorch (with CUDA), `transformers>=4.57.x`, and `diffusers>=0.37`, you don't need to build a new conda env. Just install the small set of utility deps the inference code pulls eagerly and you're ready to run the diffusers backbones (`flux`/`flux2`/`sd3`/`zimage`):
>
> ```bash
> pip install hydra-core==1.3.2 omegaconf==2.3.0 \
>     attrs einops loguru termcolor fvcore iopath pynvml wandb \
>     imageio opencv-python-headless pandas \
>     safetensors "huggingface-hub>=1.0" sentencepiece boto3 botocore
> pip install -e .
> ```
>
> For the `dinov2` / `siglip` backbones you additionally need the upstream RAE / Scale-RAE repos plus a couple of extra packages β€” see [docs/dinov2_siglip.md](docs/dinov2_siglip.md).

Full conda-managed install (preferred if you're starting from scratch):

```bash
conda env create -f environment.yml
conda activate pid

# 2. Install this package in editable mode.
pip install -e .
```

## Checkpoints and assets

Pretrained PiD checkpoints live under `checkpoints/`. Each diffusers backbone ships
two variants β€” the original `2k` decoder (trained at 2048px) and a `2kto4k` decoder
(trained with multi-resolution data bucketing 2048β†’3840 + an SD3-style dynamic
shift, intended for 1024 LDM β†’ 4K decoding). Pick the variant at the CLI via
`--pid_ckpt_type {2k,2kto4k}` (default: `2k`).

### Downloading

The released decoder weights and the encoder/decoder ("VAE") weights they
depend on are hosted at [`nvidia/PiD`](https://huggingface.co/nvidia/PiD) on
the Hugging Face Hub. Pull just the `checkpoints/` tree into this repo:

```bash
hf download nvidia/PiD --local-dir . --include "checkpoints/*"
```

## Running inference

PiD ships two complementary entry points per backbone:

| Backbone | `from_clean_*` (image β†’ encode β†’ PiD) | `from_ldm_*` (text/class β†’ LDM β†’ PiD) |
|----------|---------------------------------------|---------------------------------------|
| flux     | `from_clean_flux.py`    | `from_ldm_flux.py`    |
| flux2    | `from_clean_flux2.py`   | `from_ldm_flux2.py`   |
| sd3      | `from_clean_sd3.py`     | `from_ldm_sd3.py`     |
| zimage   | reuses `flux`           | `from_ldm_zimage.py`  |
| dinov2   | `from_clean_dinov2.py`  | `from_ldm_dinov2.py`  |
| siglip   | `from_clean_siglip.py`  | `from_ldm_siglip.py`  |

All scripts live under `pid/_src/inference/` and decode each captured latent
twice β€” once with the backbone's native VAE (baseline) and once with PiD.

> [!IMPORTANT]
> Picking the checkpoint variant β€” `--pid_ckpt_type`
> Every entry point accepts `--pid_ckpt_type {2k,2kto4k}` (default `2k`):
>
> - **`2k`** β€” the original 2048px-trained decoder.
> - **`2kto4k`** β€” the up-to-4K-resolution decoder. > > Available for `flux` / `flux2` / `sd3` / `zimage` only. Worse than `2k` at 2048px resolution.
>
> For the exact checkpoint path for each backbone, see [docs/checkpoints.md](docs/checkpoints.md).
> A quick sanity check that the right variant loaded: when `2kto4k` is active you
should see `PixelDiT dynamic shift: base_shift=4.0 base_image_size=1024` in the
init log; for `2k` that line is absent. Both `2k` and `2kto4k` support non-square aspect ratios.

### πŸ“• `from_ldm_*`: text / class β†’ latent diffusion β†’ PiD decode

Runs the corresponding latent-diffusion backbone on a prompt (or class id for
the class-conditional `dinov2` backbone), captures the intermediate `x_t` at
user-specified denoising steps (early LDM termination) and the final clean `x_0`, then decodes
each captured latent with both the native VAE / RAE decoder (baseline) and PiD.

For `flux` / `flux2` / `sd3` / `zimage` the LDM is a HuggingFace `diffusers`
pipeline (`FluxPipeline`, `Flux2Pipeline`, `StableDiffusion3Pipeline`,
`ZImagePipeline`).

For `dinov2` and `siglip` the LDM is the upstream
[RAE](https://github.com/bytetriper/RAE) (class-conditional ImageNet-512) or
[Scale-RAE](https://github.com/ZitengWangNYU/Scale-RAE) (text-conditional
256px) repo β€” see the optional-deps section below for installation.

#### Example 1 β€” Single-GPU, single prompt (Flux, default `2k` decoder)

```bash
PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
    --prompt "A photorealistic half-body portrait of a brown tabby cat with bold stripes sitting attentively on a rustic wooden kitchen table, soft morning light streaming sideways through a large window, fine fur detail and stripe patterns sharply visible, intense amber-green eyes in razor-sharp focus, warm farmhouse kitchen softly out of focus, cinematic shallow depth of field, ultra-detailed fur texture, photorealistic" \
    --ldm_inference_steps 28 --save_xt_steps 24 \
    --output_dir ./results/official_demo/flux \
    --cfg_scale 1 --pid_inference_steps 4 --scale 4
```

#### Example 2 β€” Single-GPU, 4K decode (Flux, `2kto4k` decoder)

Same backbone as Example 1 but with `--resolution 1024 --pid_ckpt_type 2kto4k`,
so the LDM produces a 1024Β² latent and PiD decodes it to 4K.

```bash
PYTHONPATH=. python -m pid._src.inference.from_ldm_flux \
    --prompt "A photorealistic half-body portrait of a brown tabby cat with bold stripes sitting attentively on a rustic wooden kitchen table, soft morning light streaming sideways through a large window, fine fur detail and stripe patterns sharply visible, intense amber-green eyes in razor-sharp focus, warm farmhouse kitchen softly out of focus, cinematic shallow depth of field, ultra-detailed fur texture, photorealistic" \
    --resolution 1024 --pid_ckpt_type 2kto4k \
    --ldm_inference_steps 28 --save_xt_steps 24 \
    --output_dir ./results/official_demo/flux_4k \
    --cfg_scale 1 --pid_inference_steps 4 --scale 4
```

#### Example 3 β€” Multi-GPU with a prompt file (Z-Image)

`torchrun` shards `--prompt_file` across ranks; each rank writes to
`--output_dir` independently.

```bash
PYTHONPATH=. torchrun --nproc_per_node=4 \
    -m pid._src.inference.from_ldm_zimage \
    --prompt_file pid/_src/inference/prompts/prompt_creative.txt \
    --ldm_inference_steps 50 --save_xt_steps 46 \
    --output_dir ./results/official_demo/zimage \
    --cfg_scale 1 --pid_inference_steps 4 --scale 4
```

#### `dinov2` / `siglip` backbones

The upstream RAE / Scale-RAE LDMs don't live in `diffusers` β€” see
[`docs/dinov2_siglip.md`](docs/dinov2_siglip.md) for setup and end-to-end
examples.

#### Suggested step settings per diffusers backbone

(See each script's docstring for the exact recipe.)

| Backbone | LDM steps flag          | Default steps | `--save_xt_steps` (example) | Best `--save_xt_steps` |
|----------|-------------------------|---------------|-----------------------------|----------------------|
| flux     | `--ldm_inference_steps` | 28            | `22 24 26`         | 24  |
| sd3      | `--ldm_inference_steps` | 28            | `22 24 26`         | 24  |
| flux2    | `--ldm_inference_steps` | 50            | `44 46 48`         | 46  |
| zimage   | `--ldm_inference_steps` | 50            | `44 46 48`         | 46  |

---
### πŸ“— `from_clean_*`: image β†’ VAE encode β†’ PiD decode

No latent diffusion model is run. The input image is encode by VAE,
optionally corrupted with Gaussian noise at each
sigma in `--degrade_sigmas`, then decoded by PiD at `--scale * input_resolution`.

Single-GPU example (Flux):

```bash
PYTHONPATH=. python -m pid._src.inference.from_clean_flux \
    --manifest assets/clean_image_manifest.jsonl \
    --input_resolution 512 \
    --degrade_sigmas 0.0 \
    --output_dir ./results/official_demo_from_clean/flux \
    --cfg_scale 1 --pid_inference_steps 4 --scale 4
```

You can pass a single image with `--input_path` and a prompt with `--prompt`
instead of `--manifest`, and a sigma sweep such as `--degrade_sigmas 0.0 0.2 0.4 0.8`
to decode noise-corrupted latents.

The `dinov2` / `siglip` `from_clean_*` flows take the same flags but with
different default resolutions and scales β€”
see [`docs/dinov2_siglip.md`](docs/dinov2_siglip.md).

### Common arguments

| Flag | Meaning |
|------|---------|
| `--pid_inference_steps`| Number of denoising steps for PiD (4 for the released distilled checkpoints) |
| `--scale`              | PiD upscale factor (output = `baseline * scale`); 8 for Scale-RAE and 4 for other backbones |
| `--cfg_scale`          | Classifier-free guidance scale for PiD |
| `--output_dir`         | Where to write the side-by-side comparison images |
| `--seed`               | Base random seed |

Multi-GPU runs use `torchrun --nproc_per_node=N`; each rank processes a shard
of the prompts / manifest entries and writes to `--output_dir` independently.

## Repository layout

```
pid/_src/inference/
β”œβ”€β”€ from_ldm_{flux,flux2,sd3,zimage,dinov2,siglip}.py  # text/class β†’ LDM β†’ PiD decode
β”œβ”€β”€ from_clean_{flux,flux2,sd3,dinov2,siglip}.py       # image β†’ encode β†’ PiD decode
β”œβ”€β”€ _demo_common.py                                    # shared CLI + run loop for from_ldm_*
β”œβ”€β”€ _demo_from_clean_common.py                         # shared CLI + run loop for from_clean_*
β”œβ”€β”€ checkpoint_registry.py                             # backbone β†’ PiD checkpoint mapping
β”œβ”€β”€ pipeline_registry.py                               # diffusers backbone β†’ HF pipeline mapping
β”œβ”€β”€ rae_generation.py                                  # DINOv2-RAE LDM helpers (from_ldm_dinov2)
β”œβ”€β”€ scale_rae_generation.py                            # Scale-RAE LDM helpers (from_ldm_siglip)
└── prompts/                                           # prompt files for from_ldm_*
```

## License

PiD codebase is licensed under the [Apache License 2.0](LICENSE).

## Contributing

See [`CONTRIBUTING.md`](CONTRIBUTING.md) for development setup, code style,
and the DCO sign-off requirement.

## Acknowledgments

The authors would like to acknowledge [Yongsheng Yu](https://www.yongshengyu.com/) and [Wei Xiong](https://wxiong.me/) for open-sourcing [PixelDiT](https://pixeldit.github.io/)'s model and weights, and thank Product Managers [Aditya Mahajan](https://www.linkedin.com/in/aditya-mahajan1) and [Matt Cragun](https://www.linkedin.com/in/mcragun/) for their valuable support and guidance.


## Citation

```bibtex
@article{lu2026pid,
    title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
    author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
    journal={arXiv preprint arXiv:2605.23902},
    year={2026}
}
```