Buckets:
| # AGENTS.md | |
| This file provides guidance to AI coding assistants (Claude, Cursor, etc.) when working with code in this repository. | |
| ## Project Overview | |
| **LTX Trainer** is a training toolkit for fine-tuning the Lightricks LTX audio-video generation models. It supports: | |
| - **Text-to-video (T2V)** - Generate video from text prompts | |
| - **Text-to-audio (T2A)** - Generate audio from text prompts | |
| - **Image-to-video (I2V)** - Generate video conditioned on a first frame | |
| - **Video extension** - Forward (prefix) and backward (suffix) video continuation | |
| - **Video inpainting** - Mask-based spatial/temporal inpainting | |
| - **Video outpainting** - Spatial crop-based outpainting | |
| - **IC-LoRA video-to-video** - In-context control adapters for style/structure transfer | |
| - **Audio-to-video (A2V)** and **Video-to-audio (V2A)** - Cross-modal generation with frozen conditioning | |
| - **Audio extension** - Forward (prefix) and backward (suffix) audio continuation | |
| - **Audio inpainting** - Mask-based audio inpainting | |
| - **IC-LoRA audio-to-audio (A2A)** - Audio reference conditioning for style transfer | |
| - **AV2AV IC-LoRA** - Combined video and audio reference conditioning | |
| - **LoRA training** - Efficient fine-tuning with adapters | |
| - **Full fine-tuning** - Complete model training | |
| All conditioning scenarios are expressed through the unified `FlexibleStrategy` configuration. | |
| **Supported model versions:** | |
| - **LTX-2** (19B, initial audio-video model) | |
| - **LTX-2.3** (22B, improved text conditioning and audio quality) | |
| Version detection is fully automatic — ltx-core reads the checkpoint config and selects the correct architecture | |
| components. The trainer does not need version-specific code paths. | |
| **Key Dependencies:** | |
| - **[`ltx-core`](../ltx-core/)** - Core model implementations (transformer, VAE, text encoder, scheduler) | |
| - **[`ltx-pipelines`](../ltx-pipelines/)** - Inference pipeline components | |
| > **Important:** This trainer only supports **LTX-2 and later** (audio-video models). The older LTXV (video-only) models | |
| > are not supported. | |
| ## Architecture Overview | |
| ### Package Structure | |
| ``` | |
| packages/ltx-trainer/ | |
| ├── src/ltx_trainer/ # Main training module | |
| │ ├── __init__.py # Logger setup, path config | |
| │ ├── config.py # Pydantic configuration models | |
| │ ├── config_display.py # Config pretty-printing | |
| │ ├── trainer.py # Main training orchestration with Accelerate | |
| │ ├── model_loader.py # Model loading using ltx-core | |
| │ ├── validation_runner.py # ValidationRunner — conditioned validation sampling | |
| │ ├── datasets.py # PrecomputedDataset, DummyDataset | |
| │ ├── training_strategies/ # Strategy pattern for different training modes | |
| │ │ ├── __init__.py # Factory function: get_training_strategy() | |
| │ │ ├── base_strategy.py # TrainingStrategy ABC, ModelInputs, TrainingStrategyConfigBase | |
| │ │ ├── flexible.py # FlexibleStrategy, FlexibleStrategyConfig [RECOMMENDED] | |
| │ │ ├── text_to_video.py # TextToVideoStrategy, TextToVideoConfig [DEPRECATED] | |
| │ │ └── video_to_video.py # VideoToVideoStrategy, VideoToVideoConfig [DEPRECATED] | |
| │ ├── timestep_samplers.py # Flow matching timestep sampling | |
| │ ├── gemma_8bit.py # 8-bit Gemma text encoder loading (bitsandbytes) | |
| │ ├── quantization.py # Transformer INT8/INT4/FP8 quantization | |
| │ ├── captioning.py # Video captioning utilities | |
| │ ├── video_utils.py # Video I/O and processing | |
| │ ├── gpu_utils.py # GPU memory helpers | |
| │ ├── hf_hub_utils.py # HuggingFace Hub integration | |
| │ ├── progress.py # Training progress display | |
| │ └── utils.py # Image I/O helpers | |
| ├── scripts/ # User-facing CLI tools | |
| │ ├── train.py # Main training script | |
| │ ├── process_dataset.py # Dataset preprocessing (latents + captions) | |
| │ ├── process_videos.py # Video latent encoding | |
| │ ├── process_captions.py # Text embedding computation | |
| │ ├── caption_videos.py # Automatic video captioning | |
| │ ├── decode_latents.py # Latent decoding for debugging | |
| │ ├── compute_reference.py # Generate IC-LoRA reference videos | |
| │ └── split_scenes.py # Scene detection and splitting | |
| ├── configs/ # Example training configurations | |
| │ ├── t2v_lora.yaml # Text-to-video LoRA | |
| │ ├── t2v_lora_low_vram.yaml # Text-to-video LoRA (low VRAM) | |
| │ ├── i2v_lora.yaml # Image-to-video LoRA | |
| │ ├── v2v_ic_lora.yaml # IC-LoRA video-to-video | |
| │ ├── a2v_lora.yaml # Audio-to-video LoRA | |
| │ ├── v2a_lora.yaml # Video-to-audio LoRA | |
| │ ├── video_extend_lora.yaml # Video extension (forward) | |
| │ ├── video_suffix_lora.yaml # Video extension (backward) | |
| │ ├── video_inpainting_lora.yaml # Video inpainting | |
| │ ├── video_outpainting_lora.yaml # Video outpainting | |
| │ ├── t2a_lora.yaml # Text-to-audio LoRA | |
| │ ├── audio_extend_lora.yaml # Audio extension (forward) | |
| │ ├── audio_suffix_lora.yaml # Audio extension (backward) | |
| │ ├── audio_inpainting_lora.yaml # Audio inpainting | |
| │ ├── a2a_ic_lora.yaml # Audio-to-audio IC-LoRA | |
| │ ├── av2av_ic_lora.yaml # AV2AV IC-LoRA | |
| │ └── accelerate/ # FSDP, DDP configs | |
| ├── tests/ # Pytest tests | |
| └── docs/ # Documentation | |
| ``` | |
| ### Key Architectural Patterns | |
| **Model Loading:** | |
| - `ltx_trainer.model_loader` provides component loaders using `ltx-core` | |
| - Individual loaders: `load_transformer()`, `load_video_vae_encoder()`, `load_video_vae_decoder()`, | |
| `load_text_encoder()`, `load_embeddings_processor()`, etc. | |
| - Combined loader: `load_model()` returns `LtxModelComponents` dataclass | |
| - Uses `SingleGPUModelBuilder` from ltx-core internally | |
| - Text encoder and embeddings processor are loaded separately (the text encoder only needs Gemma weights; the embeddings | |
| processor only needs the LTX checkpoint) | |
| - 8-bit text encoder loading via `gemma_8bit.py` (bitsandbytes) | |
| **Training Flow:** | |
| 1. Configuration loaded via Pydantic models in `config.py` | |
| 2. `LtxvTrainer` class orchestrates the training loop | |
| 3. Text encoder loaded on GPU → validation embeddings cached → heavy components unloaded (only `embeddings_processor` | |
| kept) | |
| 4. Each training step: embedding connectors applied → strategy prepares `ModelInputs` → transformer forward pass → | |
| strategy computes loss | |
| 5. Training strategies (`FlexibleStrategy`) handle mode-specific logic including conditioning, masking, and loss computation | |
| 6. Accelerate handles distributed training, mixed precision, and device placement | |
| 7. Data flows as precomputed latents through `PrecomputedDataset` | |
| **Model Interface (Modality-based):** | |
| ```python | |
| from ltx_core.model.transformer.modality import Modality | |
| video = Modality( | |
| enabled=True, | |
| latent=video_latents, # [B, seq_len, 128] patchified latent tokens | |
| sigma=sigma, # [B,] current noise level (per-batch) | |
| timesteps=video_timesteps, # [B, seq_len] per-token timestep embeddings | |
| positions=video_positions, # [B, 3, seq_len, 2] positional coordinates | |
| context=video_embeds, # text conditioning embeddings | |
| context_mask=None, # optional attention mask for text context | |
| ) | |
| audio = Modality( | |
| enabled=True, | |
| latent=audio_latents, | |
| sigma=sigma, | |
| timesteps=audio_timesteps, | |
| positions=audio_positions, # [B, 1, seq_len, 2] | |
| context=audio_embeds, | |
| context_mask=None, | |
| ) | |
| # Forward pass returns predictions for both modalities | |
| video_pred, audio_pred = model(video=video, audio=audio, perturbations=None) | |
| ``` | |
| > **Note:** `Modality` is immutable (frozen dataclass). Use `dataclasses.replace()` to modify. | |
| **`sigma` vs `timesteps`:** These serve different roles. `timesteps` is per-token (e.g. `sigma * denoise_mask` — | |
| conditioning tokens get 0, noisy tokens get sigma). `sigma` is per-batch and is used for prompt AdaLN conditioning ( | |
| LTX-2.3) and cross-modality (video↔audio) attention conditioning (both versions). | |
| **Configuration System:** | |
| - All config in `src/ltx_trainer/config.py` | |
| - Main class: `LtxTrainerConfig` | |
| - `TrainingStrategyConfig` - Union of `FlexibleStrategyConfig` | `TextToVideoConfig` (deprecated) | `VideoToVideoConfig` (deprecated) | |
| - `FlexibleStrategyConfig` - Unified strategy config with `video`/`audio` `ModalityConfig` blocks | |
| - `ModalityConfig` - Per-modality config: `is_generated`, `latents_dir`, `conditions` list | |
| - `ConditionConfig` - Discriminated union: `FirstFrameConditionConfig`, `PrefixConditionConfig`, `SuffixConditionConfig`, `SpatialCropConditionConfig`, `MaskConditionConfig`, `ReferenceConditionConfig` | |
| - `ValidationSample` - Per-sample validation config with `prompt`, `conditions`, optional `video_dims`/`seed` overrides | |
| - `ValidationCondition` - Discriminated union for validation conditions (first_frame, prefix, suffix, spatial_crop, mask, reference, video_to_audio, audio_to_video) | |
| - Uses Pydantic field validators and model validators | |
| - Config uses `extra="forbid"` — unknown fields cause validation errors | |
| - Config files in `configs/` directory | |
| ## LTX-2 vs LTX-2.3: Differences | |
| Both model versions share the same latent space interface (see [Latent Space Constants](#latent-space-constants)). | |
| The differences lie in how text conditioning and audio generation work. Version detection is automatic via checkpoint | |
| config — the trainer uses a unified API. | |
| | Component | LTX-2 (19B) | LTX-2.3 (22B) | | |
| |-----------------------|---------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------| | |
| | Feature extractor | `FeatureExtractorV1`: single `aggregate_embed`, same output for video and audio | `FeatureExtractorV2`: separate `video_aggregate_embed` + `audio_aggregate_embed`, per-token RMSNorm | | |
| | Caption projection | Inside the transformer (`caption_projection`) | Inside the feature extractor (before connector) | | |
| | Embeddings connectors | Same dimensions for video and audio | Separate dimensions (`AudioEmbeddings1DConnectorConfigurator`) | | |
| | Prompt AdaLN | Not present (`cross_attention_adaln=False`) | Active — modulates cross-attention to text using `sigma` | | |
| | Vocoder | HiFi-GAN (`Vocoder`) | BigVGAN v2 + bandwidth extension (`VocoderWithBWE`) | | |
| **How version detection works in ltx-core:** | |
| - **Feature extractor:** `_create_feature_extractor()` checks for V2 config keys (`caption_proj_before_connector`, | |
| etc.). Present → V2; absent → V1. | |
| - **Vocoder:** `VocoderConfigurator` checks for `config["vocoder"]["bwe"]`. Present → `VocoderWithBWE`; absent → | |
| `Vocoder`. | |
| - **Transformer:** `_build_caption_projections()` checks `caption_proj_before_connector`. True (V2) → no caption | |
| projection in transformer; False (V1) → caption projection created in transformer. | |
| - **Embeddings connectors:** `AudioEmbeddings1DConnectorConfigurator` reads `audio_connector_*` keys, falling back to | |
| video connector keys for V1 backward compatibility. | |
| ## Text Encoder Pipeline | |
| The `GemmaTextEncoder` implements a 3-block pipeline: | |
| 1. **Block 1 — Gemma LLM:** Tokenizes text → runs through Gemma → extracts hidden states | |
| 2. **Block 2 — Feature extractor:** Hidden states → normalized features (V1: single stream duplicated for video/audio; | |
| V2: separate video and audio projections) | |
| 3. **Block 3 — Embeddings processor:** Features → embeddings connectors → final context embeddings for the transformer | |
| **Precomputed embeddings (offline):** `process_captions.py` runs Blocks 1+2 via `text_encoder.precompute()` and saves | |
| the results. Block 3 (connectors) is applied during training via | |
| `text_encoder.embeddings_processor.create_embeddings()`. | |
| **Precomputed embeddings formats:** | |
| - **New format** (from `precompute()`): saves `video_prompt_embeds`, `audio_prompt_embeds` (optional), | |
| `prompt_attention_mask` | |
| - **Legacy format** (from old `_preprocess_text()`): saves `prompt_embeds`, `prompt_attention_mask` | |
| The trainer handles both formats in `_training_step()`: if `video_prompt_embeds` is present, it uses the new format; | |
| otherwise, it duplicates `prompt_embeds` for both modalities (mirroring V1 behavior). | |
| **After caching validation embeddings**, the trainer unloads heavy components to free VRAM: | |
| ```python | |
| self._text_encoder.model = None | |
| self._text_encoder.tokenizer = None | |
| self._text_encoder.feature_extractor = None | |
| # Only embeddings_processor (connectors) remains — used during training | |
| ``` | |
| ## Latent Space Constants | |
| These values are shared across all supported model versions: | |
| | Constant | Value | Where used | | |
| |------------------------------|----------------------------------|-----------------------------------------------------------| | |
| | Video latent channels | 128 | VAE encoder/decoder, patchifier, `VideoLatentShape` | | |
| | Spatial compression | 32× (H and W) | `SpatioTemporalScaleFactors.default()`, config validators | | |
| | Temporal compression | 8× | `SpatioTemporalScaleFactors.default()`, config validators | | |
| | Frame constraint | `frames % 8 == 1` | Config validators, validation runner | | |
| | Resolution constraint | Width and height divisible by 32 | Config validators, validation runner | | |
| | Audio latent channels | 8 | `AudioLatentShape`, audio patchifier | | |
| | Audio mel bins | 16 | `AudioLatentShape`, audio patchifier | | |
| | Patchified token dim (video) | 128 (`128 × 1 × 1 × 1`) | Transformer `in_channels` | | |
| | Patchified token dim (audio) | 128 (`8 × 16`) | Transformer `audio_in_channels` | | |
| ## Development Commands | |
| ### Setup and Installation | |
| ```bash | |
| # From the repository root | |
| uv sync | |
| cd packages/ltx-trainer | |
| ``` | |
| ### Code Quality | |
| ```bash | |
| # Run ruff linting and formatting | |
| uv run ruff check . | |
| uv run ruff format . | |
| # Run pre-commit checks | |
| uv run pre-commit run --all-files | |
| ``` | |
| ### Running Tests | |
| ```bash | |
| cd packages/ltx-trainer | |
| uv run pytest | |
| ``` | |
| ### Running Training | |
| ```bash | |
| # Single GPU | |
| uv run python scripts/train.py configs/t2v_lora.yaml | |
| # Multi-GPU with Accelerate | |
| uv run accelerate launch scripts/train.py configs/t2v_lora.yaml | |
| ``` | |
| ## Testing Standards | |
| ### Structure | |
| - **Flat functions only** — use `def test_*()`, never `class Test*` with methods. Pytest collects standalone functions. | |
| - **Only test public interfaces** — never call private methods (`_method`) directly. Verify private behavior | |
| indirectly through the public API. | |
| ### What to Test | |
| - **Custom validators and business logic** — cross-field validators, domain constraints, error paths. These catch real | |
| bugs. | |
| - **Behavioral tests** — call the public method, verify the outputs have the right shape, values, and structure. One | |
| behavioral test is worth ten config-only tests. | |
| - **Edge cases and error paths** — boundary conditions, composed behaviors, expected exceptions. | |
| - **Contract tests** — required fields, rejected invalid inputs, safety mechanisms like `extra="forbid"`. | |
| ### What NOT to Test | |
| - **Pydantic storing a value** — `Foo(x=1); assert foo.x == 1` tests Pydantic, not your code. If a behavioral test | |
| already creates the same config and uses it, the config-only test adds nothing. | |
| - **Pydantic Literal defaults** — `assert config.type == "first_frame"` when `type` is `Literal["first_frame"]`. | |
| - **Pydantic default factories** — `assert config.conditions == []` when the field has `default_factory=list`. | |
| - **Tests already covered by behavioral tests** — if `test_prefix_conditioning` creates a valid `PrefixConditionConfig` | |
| and exercises it end-to-end, a separate `test_prefix_valid` that just creates the same config is redundant. | |
| - **Trivial instantiation tests** — `strategy = Strategy(config); assert strategy.config is not None` when every other | |
| test creates a strategy. | |
| ### Keeping Tests DRY | |
| - **Use helper functions** for repeated setup patterns (e.g., `_make_strategy(video=_video_modality(...))` instead of | |
| 6-8 lines of config/strategy creation per test). | |
| - **Use named constants** for test dimensions (e.g., `VIDEO_SEQ_LEN`, `TOKENS_PER_FRAME`) instead of magic numbers. | |
| - **Merge tests that share identical setup** — when 5+ tests call `prepare_training_inputs` with the exact same | |
| config and batch, each checking one assertion, merge them into one test that checks all assertions. Pytest reports | |
| the exact failing line anyway. | |
| - **Use `@pytest.mark.parametrize`** for the same logic tested with different inputs (e.g., valid/invalid values for | |
| a field). | |
| - **Use pytest fixtures** for shared batch data and test directories, but prefer explicit helper functions over | |
| fixtures for strategy/config creation (makes the test self-documenting). | |
| ## Code Standards | |
| ### Type Hints | |
| - **Always use type hints** for all function arguments and return values | |
| - Use Python 3.10+ syntax: `list[str]` not `List[str]`, `str | Path` not `Union[str, Path]` | |
| - Use `pathlib.Path` for file operations | |
| ### Class Methods | |
| - Mark methods as `@staticmethod` if they don't access instance or class state | |
| - Use `@classmethod` for alternative constructors | |
| ### AI/ML Specific | |
| - Use `@torch.inference_mode()` for inference (prefer over `@torch.no_grad()`) | |
| - Use `accelerator.device` for distributed compatibility | |
| - Support mixed precision (`bfloat16` via dtype parameters) | |
| - Use gradient checkpointing for memory-intensive training | |
| ### Logging | |
| - Use `from ltx_trainer import logger` for all messages | |
| - Avoid print statements in production code | |
| ## Important Files & Modules | |
| ### Configuration (CRITICAL) | |
| **`src/ltx_trainer/config.py`** - Master config definitions | |
| Key classes: | |
| - `LtxTrainerConfig` - Main configuration container | |
| - `ModelConfig` - Model paths, training mode (`lora` | `full`), checkpoint loading | |
| - `TrainingStrategyConfig` - Union of `FlexibleStrategyConfig` | `TextToVideoConfig` (deprecated) | `VideoToVideoConfig` (deprecated) | |
| - `FlexibleStrategyConfig` - Unified strategy config with `video`/`audio` `ModalityConfig` blocks | |
| - `ModalityConfig` - Per-modality config: `is_generated`, `latents_dir`, `conditions` list | |
| - `ConditionConfig` - Discriminated union: `FirstFrameConditionConfig`, `PrefixConditionConfig`, `SuffixConditionConfig`, `SpatialCropConditionConfig`, `MaskConditionConfig`, `ReferenceConditionConfig` | |
| - `ValidationSample` - Per-sample validation config with `prompt`, `conditions`, optional `video_dims`/`seed` overrides | |
| - `ValidationCondition` - Discriminated union for validation conditions (first_frame, prefix, suffix, spatial_crop, mask, reference, video_to_audio, audio_to_video) | |
| - `LoraConfig` - Rank, alpha, dropout, target modules | |
| - `OptimizationConfig` - Learning rate, batch size, gradient accumulation, scheduler, gradient checkpointing | |
| - `AccelerationConfig` - Mixed precision, quantization, 8-bit text encoder | |
| - `DataConfig` - Preprocessed data root, dataloader workers | |
| - `ValidationConfig` - Prompts, video dimensions, CFG/STG guidance, audio generation, inference steps | |
| - `CheckpointsConfig` - Save interval, retention, precision | |
| - `FlowMatchingConfig` - Timestep sampling mode and parameters | |
| - `HubConfig` - HuggingFace Hub push settings | |
| - `WandbConfig` - Weights & Biases logging | |
| **⚠️ When modifying config.py:** | |
| 1. Update ALL config files in `configs/` | |
| 2. Update `docs/configuration-reference.md` | |
| 3. Test that all configs remain valid | |
| ### Training Core | |
| **`src/ltx_trainer/trainer.py`** - Main training loop (`LtxvTrainer`) | |
| - Implements distributed training with Accelerate | |
| - Handles mixed precision, gradient accumulation, checkpointing | |
| - `_training_step()` applies embedding connectors then delegates to strategy | |
| - `_load_text_encoder_and_cache_embeddings()` loads the text encoder + embeddings processor, caches validation | |
| embeddings, then unloads the Gemma LLM (keeps only the embeddings processor connectors for training) | |
| - Uses training strategies for mode-specific logic | |
| **`src/ltx_trainer/training_strategies/`** - Strategy pattern | |
| - `base_strategy.py`: `TrainingStrategy` ABC, `ModelInputs` dataclass | |
| - `flexible.py`: FlexibleStrategy — unified conditioning framework (recommended) | |
| - `text_to_video.py`: TextToVideoStrategy (deprecated — use FlexibleStrategy) | |
| - `video_to_video.py`: VideoToVideoStrategy (deprecated — use FlexibleStrategy) | |
| Key methods each strategy implements: | |
| - `prepare_training_inputs()` - Convert batch to `ModelInputs` with `Modality` objects | |
| - `compute_loss()` - Calculate training loss (velocity prediction, MSE with masking) | |
| The strategy's **config** declares its data directories via `get_data_sources()` (single source of truth, used for both dataset wiring and existence validation). | |
| **`src/ltx_trainer/model_loader.py`** - Model loading | |
| Component loaders: | |
| - `load_transformer()` → `LTXModel` | |
| - `load_video_vae_encoder()` → `VideoEncoder` | |
| - `load_video_vae_decoder()` → `VideoDecoder` | |
| - `load_audio_vae_decoder()` → `AudioDecoder` | |
| - `load_vocoder()` → `Vocoder` or `VocoderWithBWE` (auto-detected) | |
| - `load_text_encoder(gemma_model_path)` → `GemmaTextEncoder` (pure Gemma LLM, no checkpoint needed) | |
| - `load_embeddings_processor(checkpoint_path)` → `EmbeddingsProcessor` (feature extractor + connectors) | |
| - `load_model()` → `LtxModelComponents` (convenience wrapper) | |
| **`src/ltx_trainer/validation_runner.py`** - Conditioned validation sampling | |
| - Manages the full validation lifecycle: embedding caching, media encoding, denoising, decoding | |
| - Supports all validation condition types: first_frame, prefix, suffix, spatial_crop, mask, reference, video_to_audio, audio_to_video | |
| - Handles frozen modality paths (sigma=0 for conditioning modality) | |
| - Builds conditioning items using ltx-core's `VideoConditionByLatentIndex`, `VideoConditionByReferenceLatent`, `VideoConditionByMask` | |
| - Optional side-by-side reference output for IC-LoRA validation | |
| **`src/ltx_trainer/timestep_samplers.py`** - Flow matching timestep sampling | |
| - `UniformTimestepSampler` - Uniform sampling in `[min, max]` | |
| - `ShiftedLogitNormalTimestepSampler` - Stretched shifted logit-normal distribution with: | |
| - Shift determined by sequence length (more noise at higher token counts) | |
| - Percentile stretching for better `[0, 1]` coverage | |
| - Uniform fallback (10% of samples) to prevent distribution collapse | |
| - Reflection around `eps` for numerical stability near zero | |
| **`src/ltx_trainer/gemma_8bit.py`** - 8-bit text encoder loading | |
| Bypasses ltx-core's standard loading path to enable bitsandbytes 8-bit quantization of the Gemma backbone. Manually | |
| constructs the `GemmaTextEncoder` with quantized model, feature extractor, and embeddings processor. | |
| ### Data | |
| **`src/ltx_trainer/datasets.py`** - Dataset handling | |
| - `PrecomputedDataset` loads pre-computed VAE latents and text embeddings | |
| - Supports video latents, audio latents, text embeddings, reference video latents, reference audio latents, video masks, and audio masks | |
| - Handles legacy patchified format `[seq_len, C]` → automatically unpatchifies to `[C, F, H, W]` | |
| - `DummyDataset` for benchmarking and minimal testing | |
| ## Common Development Tasks | |
| ### Agent-Assisted Training | |
| When a user asks to train, fine-tune, create a LoRA, or produce a custom LTX-2 model, use the repository skill at | |
| [`.claude/skills/train-model`](../../.claude/skills/train-model/SKILL.md). The skill is the orchestrator for dataset probing, mode selection, preprocessing, | |
| training launch, monitoring, and post-train validation; it treats `packages/ltx-trainer/docs/` as the source of truth. | |
| ### Adding a New Configuration Parameter | |
| 1. Add field to appropriate config class in `src/ltx_trainer/config.py` | |
| 2. Add validator if needed | |
| 3. Update ALL config files in `configs/` | |
| 4. Update `docs/configuration-reference.md` | |
| ### Implementing a New Training Strategy | |
| The `FlexibleStrategy` now covers all use cases (T2V, T2A, I2V, V2V, A2A, AV2AV, inpainting, outpainting, extension, A2V, V2A, IC-LoRA) through | |
| configuration alone. A new strategy is only needed for fundamentally different training paradigms that cannot be | |
| expressed via `ModalityConfig` + `ConditionConfig` combinations. | |
| If you do need a new strategy: | |
| 1. Create new file in `src/ltx_trainer/training_strategies/` | |
| 2. Create config class inheriting `TrainingStrategyConfigBase` and implement `get_data_sources()` | |
| 3. Create strategy class inheriting `TrainingStrategy` | |
| 4. Implement: `prepare_training_inputs()`, `compute_loss()` | |
| 5. Add to `__init__.py`: import, add to `TrainingStrategyConfig` union, update factory | |
| 6. Add discriminator tag to config.py's `TrainingStrategyConfig` | |
| 7. Create example config file in `configs/` | |
| ### Working with Modalities | |
| ```python | |
| from dataclasses import replace | |
| from ltx_core.model.transformer.modality import Modality | |
| # Create modality — all fields except enabled and masks are required | |
| video = Modality( | |
| enabled=True, | |
| latent=latents, # [B, seq_len, 128] | |
| sigma=sigma, # [B,] — the per-batch noise level | |
| timesteps=timesteps, # [B, seq_len] — per-token (sigma * denoise_mask) | |
| positions=positions, # [B, 3, seq_len, 2] | |
| context=context, # text embeddings from embeddings_processor | |
| context_mask=None, | |
| ) | |
| # Update (immutable — must use replace) | |
| video = replace(video, latent=new_latent, sigma=new_sigma, timesteps=new_timesteps) | |
| # Disable a modality | |
| audio = replace(audio, enabled=False) | |
| ``` | |
| ### Working with the Text Encoder | |
| ```python | |
| # Full forward pass (used for validation — runs all 3 blocks) | |
| video_embeds, audio_embeds, attention_mask = text_encoder(prompt) | |
| # Precompute features (used in process_captions.py — runs blocks 1+2 only) | |
| video_features, audio_features, attention_mask = text_encoder.precompute(prompt, padding_side="left") | |
| # Apply connectors during training (block 3 only) | |
| additive_mask = text_encoder._convert_to_additive_mask(attention_mask, video_features.dtype) | |
| video_embeds, audio_embeds, binary_mask = text_encoder.embeddings_processor.create_embeddings( | |
| video_features, audio_features, additive_mask | |
| ) | |
| ``` | |
| ## Debugging Tips | |
| **Training Issues:** | |
| - Check logs first (rich logger provides context) | |
| - GPU memory: Look for OOM errors, enable `enable_gradient_checkpointing: true` | |
| - Distributed training: Check `accelerator.state` and device placement | |
| **Model Loading:** | |
| - Ensure `model_path` points to a local `.safetensors` file | |
| - Ensure `text_encoder_path` points to a Gemma model directory | |
| - URLs are NOT supported for model paths | |
| - For 8-bit loading: ensure `bitsandbytes` is installed | |
| **Configuration:** | |
| - Validation errors: Check validators in `config.py` | |
| - Unknown fields: Config uses `extra="forbid"` — all fields must be defined | |
| - FlexibleStrategy requires at least one modality with `is_generated: true` | |
| - Audio modality cannot use `first_frame` or `spatial_crop` conditions | |
| **Precomputed Data:** | |
| - Legacy data (`prompt_embeds`) works via backward-compat in `_training_step()` | |
| - New data (`video_prompt_embeds` + `audio_prompt_embeds`) is the expected format | |
| - Latents must be in `[C, F, H, W]` format (legacy `[seq_len, C]` is auto-converted) | |
| ## Key Constraints | |
| ### Frame Requirements | |
| Frames must satisfy `frames % 8 == 1`: | |
| - ✅ Valid: 1, 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 121 | |
| - ❌ Invalid: 24, 32, 48, 64, 100 | |
| ### Resolution Requirements | |
| Width and height must be divisible by 32. | |
| ### Model Paths | |
| - Must be local paths (URLs not supported) | |
| - `model_path`: Path to `.safetensors` checkpoint | |
| - `text_encoder_path`: Path to Gemma model directory | |
| ### Platform Requirements | |
| - Linux required (uses `triton` which is Linux-only) | |
| - CUDA GPU with 32GB+ VRAM recommended | |
| ## Reference: ltx-core Key Components | |
| ``` | |
| packages/ltx-core/src/ltx_core/ | |
| ├── model/ | |
| │ ├── transformer/ | |
| │ │ ├── model.py # LTXModel (diffusion transformer) | |
| │ │ ├── modality.py # Modality dataclass | |
| │ │ ├── transformer.py # BasicAVTransformerBlock | |
| │ │ ├── transformer_args.py # TransformerArgsPreprocessor (sigma → prompt AdaLN) | |
| │ │ ├── model_configurator.py # LTXModelConfigurator (version-aware) | |
| │ │ └── timestep_embedding.py # Timestep/sigma embedding | |
| │ ├── video_vae/ | |
| │ │ ├── video_vae.py # VideoEncoder, VideoDecoder | |
| │ │ └── model_configurator.py # VideoEncoderConfigurator, VideoDecoderConfigurator | |
| │ ├── audio_vae/ | |
| │ │ ├── audio_vae.py # AudioEncoder, AudioDecoder | |
| │ │ └── vocoder.py # Vocoder, VocoderWithBWE (output_sampling_rate) | |
| │ └── common/ # Shared model components | |
| ├── text_encoders/gemma/ | |
| │ ├── __init__.py # Exports: GemmaTextEncoder, GemmaTextEncoderConfigurator, | |
| │ │ # AV_GEMMA_TEXT_ENCODER_KEY_OPS, GEMMA_MODEL_OPS, | |
| │ │ # module_ops_from_gemma_root | |
| │ ├── encoders/ | |
| │ │ ├── base_encoder.py # GemmaTextEncoder (unified 3-block pipeline) | |
| │ │ └── encoder_configurator.py # GemmaTextEncoderConfigurator, _create_feature_extractor | |
| │ ├── feature_extractor.py # FeatureExtractorV1 (19B), FeatureExtractorV2 (22B) | |
| │ ├── embeddings_connector.py # Embeddings1DConnector, Embeddings1DConnectorConfigurator, | |
| │ │ # AudioEmbeddings1DConnectorConfigurator | |
| │ ├── embeddings_processor.py # EmbeddingsProcessor (wraps video + audio connectors) | |
| │ └── tokenizer.py # LTXVGemmaTokenizer | |
| ├── components/ | |
| │ ├── schedulers.py # LTX2Scheduler | |
| │ ├── diffusion_steps.py # EulerDiffusionStep | |
| │ ├── guiders.py # CFGGuider, STGGuider | |
| │ └── patchifiers.py # VideoLatentPatchifier, AudioPatchifier | |
| ├── conditioning/ # ConditioningItem, mask_utils, types | |
| ├── tools.py # VideoLatentTools, AudioLatentTools | |
| ├── loader/ | |
| │ ├── single_gpu_model_builder.py # SingleGPUModelBuilder | |
| │ ├── sft_loader.py # SafetensorsModelStateDictLoader | |
| │ └── sd_ops.py # Key remapping (SDOps) | |
| └── types.py # SpatioTemporalScaleFactors, VideoLatentShape, AudioLatentShape | |
| ``` | |
Xet Storage Details
- Size:
- 31.9 kB
- Xet hash:
- 4026ab01ec08198527f48ec6533931a5ef3e8cb58a7296b156cac8109807fb46
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.