Instructions to use YiYiXu/waypoint-1-small with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use YiYiXu/waypoint-1-small with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("YiYiXu/waypoint-1-small", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Draw Things
- DiffusionBee
| library_name: diffusers | |
| tags: | |
| - modular-diffusers | |
| - diffusers | |
| - world_engine | |
| - text-to-image | |
| This is a modular diffusion pipeline built with 🧨 Diffusers' modular pipeline framework. | |
| **Pipeline Type**: WorldEngineBlocks | |
| **Description**: | |
| This pipeline uses a 5-block architecture that can be customized and extended. | |
| ## Example Usage | |
| [TODO] | |
| ## Pipeline Architecture | |
| This modular pipeline is composed of the following blocks: | |
| 1. **text_encoder** (`WorldEngineTextEncoderStep`) | |
| - Text Encoder step that generates text embeddings to guide frame generation | |
| 2. **controller_encoder** (`WorldEngineControllerEncoderStep`) | |
| - Controller Encoder step that encodes mouse, button, and scroll inputs for conditioning | |
| 3. **before_denoise** (`WorldEngineBeforeDenoiseStep`) | |
| - Before denoise step that prepares inputs for denoising: | |
| - *set_timesteps*: `WorldEngineSetTimestepsStep` | |
| - Sets up scheduler sigmas for rectified flow denoising | |
| - *setup_kv_cache*: `WorldEngineSetupKVCacheStep` | |
| - Initializes or reuses KV cache for autoregressive frame generation | |
| - *prepare_latents*: `WorldEnginePrepareLatentsStep` | |
| - Prepares latents for frame generation. If an image is provided on the first frame, encodes it and caches it as context. Always creates fresh random noise for the actual denoising. | |
| 4. **denoise** (`WorldEngineDenoiseLoop`) | |
| - Denoises latents using rectified flow (x = x + dsigma * v) and updates KV cache for autoregressive generation. | |
| 5. **decode** (`WorldEngineDecodeStep`) | |
| - Decodes denoised latents to RGB image using the VAE decoder | |
| ## Model Components | |
| 1. text_encoder (`UMT5EncoderModel`) | |
| 2. tokenizer (`AutoTokenizer`) | |
| 3. image_processor (`VaeImageProcessor`) | |
| 4. transformer (`AutoModel`) | |
| 5. vae (`AutoModel`) | |
| ## Configuration Parameters | |
| n_buttons (default: 256) | |
| scheduler_sigmas (default: [1.0, 0.94921875, 0.83984375, 0.0]) | |
| channels (default: 16) | |
| height (default: 16) | |
| width (default: 16) | |
| patch (default: [2, 2]) | |
| vae_scale_factor (default: 16) | |
| ## Input/Output Specification | |
| ### Inputs **Optional:** | |
| - `prompt` (`Any`): The prompt or prompts to guide the frame generation | |
| - `prompt_embeds` (`Tensor`): Pre-computed text embeddings | |
| - `prompt_pad_mask` (`Tensor`): Padding mask for prompt embeddings | |
| - `button` (`Set`), default: `set()`: Set of pressed button IDs | |
| - `mouse` (`Tuple`), default: `(0.0, 0.0)`: Mouse velocity (x, y) | |
| - `scroll` (`int`), default: `0`: Scroll wheel direction (-1, 0, 1) | |
| - `button_tensor` (`Tensor`): One-hot encoded button tensor | |
| - `mouse_tensor` (`Tensor`): Mouse velocity tensor | |
| - `scroll_tensor` (`Tensor`): Scroll wheel sign tensor | |
| - `scheduler_sigmas` (`List`): Custom scheduler sigmas (overrides config) | |
| - `frame_timestamp` (`Tensor`): Current frame timestamp | |
| - `kv_cache` (`Optional`): Existing KV cache (will be reused if provided) | |
| - `reset_cache` (`bool`), default: `False`: If True, reset the KV cache even if one exists | |
| - `image` (`Union`): Input image (PIL Image or [H, W, 3] uint8 tensor), only used on first frame | |
| - `latents` (`Tensor`): Latent tensor for denoising [1, 1, C, H, W]. Only used if use_random_latents=False. | |
| - `use_random_latents` (`bool`), default: `True`: If True, always generate fresh random latents. If False, use provided latents. | |
| - `generator` (`Generator`): torch Generator for deterministic output | |
| - `output_type` (`Any`), default: `pil`: The output format for the generated images (pil, latent, pt, or np) | |
| ### Outputs - `prompt_embeds` (`Tensor`): Text embeddings used to guide frame generation | |
| - `prompt_pad_mask` (`Tensor`): Padding mask for prompt embeddings | |
| - `button_tensor` (`Tensor`): One-hot encoded button tensor | |
| - `mouse_tensor` (`Tensor`): Mouse velocity tensor | |
| - `scroll_tensor` (`Tensor`): Scroll wheel sign tensor | |
| - `scheduler_sigmas` (`Tensor`): Tensor of scheduler sigmas for denoising | |
| - `frame_timestamp` (`Tensor`): Current frame timestamp | |
| - `kv_cache` (`StaticKVCache`): KV cache for transformer attention | |
| - `latents` (`Tensor`): Latent tensor for denoising [1, 1, C, H, W] | |
| - `images` (`Union`): Decoded RGB image in requested output format | |