Buckets:

ltx-community
/

ltx2-trainer-src-v2

Files

xet

ltx-community/ltx2-trainer-src-v2 / packages /ltx-trainer /AGENTS.md

linoyts

9 days ago

preview code

download

raw

31.9 kB

	# AGENTS.md

	This file provides guidance to AI coding assistants (Claude, Cursor, etc.) when working with code in this repository.

	## Project Overview

	LTX Trainer is a training toolkit for fine-tuning the Lightricks LTX audio-video generation models. It supports:

	- Text-to-video (T2V) - Generate video from text prompts
	- Text-to-audio (T2A) - Generate audio from text prompts
	- Image-to-video (I2V) - Generate video conditioned on a first frame
	- Video extension - Forward (prefix) and backward (suffix) video continuation
	- Video inpainting - Mask-based spatial/temporal inpainting
	- Video outpainting - Spatial crop-based outpainting
	- IC-LoRA video-to-video - In-context control adapters for style/structure transfer
	- Audio-to-video (A2V) and Video-to-audio (V2A) - Cross-modal generation with frozen conditioning
	- Audio extension - Forward (prefix) and backward (suffix) audio continuation
	- Audio inpainting - Mask-based audio inpainting
	- IC-LoRA audio-to-audio (A2A) - Audio reference conditioning for style transfer
	- AV2AV IC-LoRA - Combined video and audio reference conditioning
	- LoRA training - Efficient fine-tuning with adapters
	- Full fine-tuning - Complete model training

	All conditioning scenarios are expressed through the unified `FlexibleStrategy` configuration.

	Supported model versions:

	- LTX-2 (19B, initial audio-video model)
	- LTX-2.3 (22B, improved text conditioning and audio quality)

	Version detection is fully automatic — ltx-core reads the checkpoint config and selects the correct architecture
	components. The trainer does not need version-specific code paths.

	Key Dependencies:

	- [`ltx-core`](../ltx-core/) - Core model implementations (transformer, VAE, text encoder, scheduler)
	- [`ltx-pipelines`](../ltx-pipelines/) - Inference pipeline components

	> Important: This trainer only supports LTX-2 and later (audio-video models). The older LTXV (video-only) models
	> are not supported.

	## Architecture Overview

	### Package Structure

	```
	packages/ltx-trainer/
	├── src/ltx_trainer/ # Main training module
	│ ├── __init__.py # Logger setup, path config
	│ ├── config.py # Pydantic configuration models
	│ ├── config_display.py # Config pretty-printing
	│ ├── trainer.py # Main training orchestration with Accelerate
	│ ├── model_loader.py # Model loading using ltx-core
	│ ├── validation_runner.py # ValidationRunner — conditioned validation sampling
	│ ├── datasets.py # PrecomputedDataset, DummyDataset
	│ ├── training_strategies/ # Strategy pattern for different training modes
	│ │ ├── __init__.py # Factory function: get_training_strategy()
	│ │ ├── base_strategy.py # TrainingStrategy ABC, ModelInputs, TrainingStrategyConfigBase
	│ │ ├── flexible.py # FlexibleStrategy, FlexibleStrategyConfig [RECOMMENDED]
	│ │ ├── text_to_video.py # TextToVideoStrategy, TextToVideoConfig [DEPRECATED]
	│ │ └── video_to_video.py # VideoToVideoStrategy, VideoToVideoConfig [DEPRECATED]
	│ ├── timestep_samplers.py # Flow matching timestep sampling
	│ ├── gemma_8bit.py # 8-bit Gemma text encoder loading (bitsandbytes)
	│ ├── quantization.py # Transformer INT8/INT4/FP8 quantization
	│ ├── captioning.py # Video captioning utilities
	│ ├── video_utils.py # Video I/O and processing
	│ ├── gpu_utils.py # GPU memory helpers
	│ ├── hf_hub_utils.py # HuggingFace Hub integration
	│ ├── progress.py # Training progress display
	│ └── utils.py # Image I/O helpers
	├── scripts/ # User-facing CLI tools
	│ ├── train.py # Main training script
	│ ├── process_dataset.py # Dataset preprocessing (latents + captions)
	│ ├── process_videos.py # Video latent encoding
	│ ├── process_captions.py # Text embedding computation
	│ ├── caption_videos.py # Automatic video captioning
	│ ├── decode_latents.py # Latent decoding for debugging
	│ ├── compute_reference.py # Generate IC-LoRA reference videos
	│ └── split_scenes.py # Scene detection and splitting
	├── configs/ # Example training configurations
	│ ├── t2v_lora.yaml # Text-to-video LoRA
	│ ├── t2v_lora_low_vram.yaml # Text-to-video LoRA (low VRAM)
	│ ├── i2v_lora.yaml # Image-to-video LoRA
	│ ├── v2v_ic_lora.yaml # IC-LoRA video-to-video
	│ ├── a2v_lora.yaml # Audio-to-video LoRA
	│ ├── v2a_lora.yaml # Video-to-audio LoRA
	│ ├── video_extend_lora.yaml # Video extension (forward)
	│ ├── video_suffix_lora.yaml # Video extension (backward)
	│ ├── video_inpainting_lora.yaml # Video inpainting
	│ ├── video_outpainting_lora.yaml # Video outpainting
	│ ├── t2a_lora.yaml # Text-to-audio LoRA
	│ ├── audio_extend_lora.yaml # Audio extension (forward)
	│ ├── audio_suffix_lora.yaml # Audio extension (backward)
	│ ├── audio_inpainting_lora.yaml # Audio inpainting
	│ ├── a2a_ic_lora.yaml # Audio-to-audio IC-LoRA
	│ ├── av2av_ic_lora.yaml # AV2AV IC-LoRA
	│ └── accelerate/ # FSDP, DDP configs
	├── tests/ # Pytest tests
	└── docs/ # Documentation
	```

	### Key Architectural Patterns

	Model Loading:

	- `ltx_trainer.model_loader` provides component loaders using `ltx-core`
	- Individual loaders: `load_transformer()`, `load_video_vae_encoder()`, `load_video_vae_decoder()`,
	`load_text_encoder()`, `load_embeddings_processor()`, etc.
	- Combined loader: `load_model()` returns `LtxModelComponents` dataclass
	- Uses `SingleGPUModelBuilder` from ltx-core internally
	- Text encoder and embeddings processor are loaded separately (the text encoder only needs Gemma weights; the embeddings
	processor only needs the LTX checkpoint)
	- 8-bit text encoder loading via `gemma_8bit.py` (bitsandbytes)

	Training Flow:

	1. Configuration loaded via Pydantic models in `config.py`
	2. `LtxvTrainer` class orchestrates the training loop
	3. Text encoder loaded on GPU → validation embeddings cached → heavy components unloaded (only `embeddings_processor`
	kept)
	4. Each training step: embedding connectors applied → strategy prepares `ModelInputs` → transformer forward pass →
	strategy computes loss
	5. Training strategies (`FlexibleStrategy`) handle mode-specific logic including conditioning, masking, and loss computation
	6. Accelerate handles distributed training, mixed precision, and device placement
	7. Data flows as precomputed latents through `PrecomputedDataset`

	Model Interface (Modality-based):

	```python
	from ltx_core.model.transformer.modality import Modality

	video = Modality(
	enabled=True,
	latent=video_latents, # [B, seq_len, 128] patchified latent tokens
	sigma=sigma, # [B,] current noise level (per-batch)
	timesteps=video_timesteps, # [B, seq_len] per-token timestep embeddings
	positions=video_positions, # [B, 3, seq_len, 2] positional coordinates
	context=video_embeds, # text conditioning embeddings
	context_mask=None, # optional attention mask for text context
	)
	audio = Modality(
	enabled=True,
	latent=audio_latents,
	sigma=sigma,
	timesteps=audio_timesteps,
	positions=audio_positions, # [B, 1, seq_len, 2]
	context=audio_embeds,
	context_mask=None,
	)

	# Forward pass returns predictions for both modalities
	video_pred, audio_pred = model(video=video, audio=audio, perturbations=None)
	```

	> Note: `Modality` is immutable (frozen dataclass). Use `dataclasses.replace()` to modify.

	`sigma` vs `timesteps`: These serve different roles. `timesteps` is per-token (e.g. `sigma * denoise_mask` —
	conditioning tokens get 0, noisy tokens get sigma). `sigma` is per-batch and is used for prompt AdaLN conditioning (
	LTX-2.3) and cross-modality (video↔audio) attention conditioning (both versions).

	Configuration System:

	- All config in `src/ltx_trainer/config.py`
	- Main class: `LtxTrainerConfig`
	- `TrainingStrategyConfig` - Union of `FlexibleStrategyConfig` \| `TextToVideoConfig` (deprecated) \| `VideoToVideoConfig` (deprecated)
	- `FlexibleStrategyConfig` - Unified strategy config with `video`/`audio` `ModalityConfig` blocks
	- `ModalityConfig` - Per-modality config: `is_generated`, `latents_dir`, `conditions` list
	- `ConditionConfig` - Discriminated union: `FirstFrameConditionConfig`, `PrefixConditionConfig`, `SuffixConditionConfig`, `SpatialCropConditionConfig`, `MaskConditionConfig`, `ReferenceConditionConfig`
	- `ValidationSample` - Per-sample validation config with `prompt`, `conditions`, optional `video_dims`/`seed` overrides
	- `ValidationCondition` - Discriminated union for validation conditions (first_frame, prefix, suffix, spatial_crop, mask, reference, video_to_audio, audio_to_video)
	- Uses Pydantic field validators and model validators
	- Config uses `extra="forbid"` — unknown fields cause validation errors
	- Config files in `configs/` directory

	## LTX-2 vs LTX-2.3: Differences

	Both model versions share the same latent space interface (see [Latent Space Constants](#latent-space-constants)).
	The differences lie in how text conditioning and audio generation work. Version detection is automatic via checkpoint
	config — the trainer uses a unified API.

	\| Component \| LTX-2 (19B) \| LTX-2.3 (22B) \|
	\|-----------------------\|---------------------------------------------------------------------------------\|-----------------------------------------------------------------------------------------------------\|
	\| Feature extractor \| `FeatureExtractorV1`: single `aggregate_embed`, same output for video and audio \| `FeatureExtractorV2`: separate `video_aggregate_embed` + `audio_aggregate_embed`, per-token RMSNorm \|
	\| Caption projection \| Inside the transformer (`caption_projection`) \| Inside the feature extractor (before connector) \|
	\| Embeddings connectors \| Same dimensions for video and audio \| Separate dimensions (`AudioEmbeddings1DConnectorConfigurator`) \|
	\| Prompt AdaLN \| Not present (`cross_attention_adaln=False`) \| Active — modulates cross-attention to text using `sigma` \|
	\| Vocoder \| HiFi-GAN (`Vocoder`) \| BigVGAN v2 + bandwidth extension (`VocoderWithBWE`) \|

	How version detection works in ltx-core:

	- Feature extractor: `_create_feature_extractor()` checks for V2 config keys (`caption_proj_before_connector`,
	etc.). Present → V2; absent → V1.
	- Vocoder: `VocoderConfigurator` checks for `config["vocoder"]["bwe"]`. Present → `VocoderWithBWE`; absent →
	`Vocoder`.
	- Transformer: `_build_caption_projections()` checks `caption_proj_before_connector`. True (V2) → no caption
	projection in transformer; False (V1) → caption projection created in transformer.
	- Embeddings connectors: `AudioEmbeddings1DConnectorConfigurator` reads `audio_connector_*` keys, falling back to
	video connector keys for V1 backward compatibility.

	## Text Encoder Pipeline

	The `GemmaTextEncoder` implements a 3-block pipeline:

	1. Block 1 — Gemma LLM: Tokenizes text → runs through Gemma → extracts hidden states
	2. Block 2 — Feature extractor: Hidden states → normalized features (V1: single stream duplicated for video/audio;
	V2: separate video and audio projections)
	3. Block 3 — Embeddings processor: Features → embeddings connectors → final context embeddings for the transformer

	Precomputed embeddings (offline): `process_captions.py` runs Blocks 1+2 via `text_encoder.precompute()` and saves
	the results. Block 3 (connectors) is applied during training via
	`text_encoder.embeddings_processor.create_embeddings()`.

	Precomputed embeddings formats:

	- New format (from `precompute()`): saves `video_prompt_embeds`, `audio_prompt_embeds` (optional),
	`prompt_attention_mask`
	- Legacy format (from old `_preprocess_text()`): saves `prompt_embeds`, `prompt_attention_mask`

	The trainer handles both formats in `_training_step()`: if `video_prompt_embeds` is present, it uses the new format;
	otherwise, it duplicates `prompt_embeds` for both modalities (mirroring V1 behavior).

	After caching validation embeddings, the trainer unloads heavy components to free VRAM:

	```python
	self._text_encoder.model = None
	self._text_encoder.tokenizer = None
	self._text_encoder.feature_extractor = None
	# Only embeddings_processor (connectors) remains — used during training
	```

	## Latent Space Constants

	These values are shared across all supported model versions:

	\| Constant \| Value \| Where used \|
	\|------------------------------\|----------------------------------\|-----------------------------------------------------------\|
	\| Video latent channels \| 128 \| VAE encoder/decoder, patchifier, `VideoLatentShape` \|
	\| Spatial compression \| 32× (H and W) \| `SpatioTemporalScaleFactors.default()`, config validators \|
	\| Temporal compression \| 8× \| `SpatioTemporalScaleFactors.default()`, config validators \|
	\| Frame constraint \| `frames % 8 == 1` \| Config validators, validation runner \|
	\| Resolution constraint \| Width and height divisible by 32 \| Config validators, validation runner \|
	\| Audio latent channels \| 8 \| `AudioLatentShape`, audio patchifier \|
	\| Audio mel bins \| 16 \| `AudioLatentShape`, audio patchifier \|
	\| Patchified token dim (video) \| 128 (`128 × 1 × 1 × 1`) \| Transformer `in_channels` \|
	\| Patchified token dim (audio) \| 128 (`8 × 16`) \| Transformer `audio_in_channels` \|

	## Development Commands

	### Setup and Installation

	```bash
	# From the repository root
	uv sync
	cd packages/ltx-trainer
	```

	### Code Quality

	```bash
	# Run ruff linting and formatting
	uv run ruff check .
	uv run ruff format .

	# Run pre-commit checks
	uv run pre-commit run --all-files
	```

	### Running Tests

	```bash
	cd packages/ltx-trainer
	uv run pytest
	```

	### Running Training

	```bash
	# Single GPU
	uv run python scripts/train.py configs/t2v_lora.yaml

	# Multi-GPU with Accelerate
	uv run accelerate launch scripts/train.py configs/t2v_lora.yaml
	```

	## Testing Standards

	### Structure

	- Flat functions only — use `def test_()`, never `class Test` with methods. Pytest collects standalone functions.
	- Only test public interfaces — never call private methods (`_method`) directly. Verify private behavior
	indirectly through the public API.

	### What to Test

	- Custom validators and business logic — cross-field validators, domain constraints, error paths. These catch real
	bugs.
	- Behavioral tests — call the public method, verify the outputs have the right shape, values, and structure. One
	behavioral test is worth ten config-only tests.
	- Edge cases and error paths — boundary conditions, composed behaviors, expected exceptions.
	- Contract tests — required fields, rejected invalid inputs, safety mechanisms like `extra="forbid"`.

	### What NOT to Test

	- Pydantic storing a value — `Foo(x=1); assert foo.x == 1` tests Pydantic, not your code. If a behavioral test
	already creates the same config and uses it, the config-only test adds nothing.
	- Pydantic Literal defaults — `assert config.type == "first_frame"` when `type` is `Literal["first_frame"]`.
	- Pydantic default factories — `assert config.conditions == []` when the field has `default_factory=list`.
	- Tests already covered by behavioral tests — if `test_prefix_conditioning` creates a valid `PrefixConditionConfig`
	and exercises it end-to-end, a separate `test_prefix_valid` that just creates the same config is redundant.
	- Trivial instantiation tests — `strategy = Strategy(config); assert strategy.config is not None` when every other
	test creates a strategy.

	### Keeping Tests DRY

	- Use helper functions for repeated setup patterns (e.g., `_make_strategy(video=_video_modality(...))` instead of
	6-8 lines of config/strategy creation per test).
	- Use named constants for test dimensions (e.g., `VIDEO_SEQ_LEN`, `TOKENS_PER_FRAME`) instead of magic numbers.
	- Merge tests that share identical setup — when 5+ tests call `prepare_training_inputs` with the exact same
	config and batch, each checking one assertion, merge them into one test that checks all assertions. Pytest reports
	the exact failing line anyway.
	- Use `@pytest.mark.parametrize` for the same logic tested with different inputs (e.g., valid/invalid values for
	a field).
	- Use pytest fixtures for shared batch data and test directories, but prefer explicit helper functions over
	fixtures for strategy/config creation (makes the test self-documenting).

	## Code Standards

	### Type Hints

	- Always use type hints for all function arguments and return values
	- Use Python 3.10+ syntax: `list[str]` not `List[str]`, `str \| Path` not `Union[str, Path]`
	- Use `pathlib.Path` for file operations

	### Class Methods

	- Mark methods as `@staticmethod` if they don't access instance or class state
	- Use `@classmethod` for alternative constructors

	### AI/ML Specific

	- Use `@torch.inference_mode()` for inference (prefer over `@torch.no_grad()`)
	- Use `accelerator.device` for distributed compatibility
	- Support mixed precision (`bfloat16` via dtype parameters)
	- Use gradient checkpointing for memory-intensive training

	### Logging

	- Use `from ltx_trainer import logger` for all messages
	- Avoid print statements in production code

	## Important Files & Modules

	### Configuration (CRITICAL)

	`src/ltx_trainer/config.py` - Master config definitions

	Key classes:

	- `LtxTrainerConfig` - Main configuration container
	- `ModelConfig` - Model paths, training mode (`lora` \| `full`), checkpoint loading
	- `TrainingStrategyConfig` - Union of `FlexibleStrategyConfig` \| `TextToVideoConfig` (deprecated) \| `VideoToVideoConfig` (deprecated)
	- `FlexibleStrategyConfig` - Unified strategy config with `video`/`audio` `ModalityConfig` blocks
	- `ModalityConfig` - Per-modality config: `is_generated`, `latents_dir`, `conditions` list
	- `ConditionConfig` - Discriminated union: `FirstFrameConditionConfig`, `PrefixConditionConfig`, `SuffixConditionConfig`, `SpatialCropConditionConfig`, `MaskConditionConfig`, `ReferenceConditionConfig`
	- `ValidationSample` - Per-sample validation config with `prompt`, `conditions`, optional `video_dims`/`seed` overrides
	- `ValidationCondition` - Discriminated union for validation conditions (first_frame, prefix, suffix, spatial_crop, mask, reference, video_to_audio, audio_to_video)
	- `LoraConfig` - Rank, alpha, dropout, target modules
	- `OptimizationConfig` - Learning rate, batch size, gradient accumulation, scheduler, gradient checkpointing
	- `AccelerationConfig` - Mixed precision, quantization, 8-bit text encoder
	- `DataConfig` - Preprocessed data root, dataloader workers
	- `ValidationConfig` - Prompts, video dimensions, CFG/STG guidance, audio generation, inference steps
	- `CheckpointsConfig` - Save interval, retention, precision
	- `FlowMatchingConfig` - Timestep sampling mode and parameters
	- `HubConfig` - HuggingFace Hub push settings
	- `WandbConfig` - Weights & Biases logging

	⚠️ When modifying config.py:

	1. Update ALL config files in `configs/`
	2. Update `docs/configuration-reference.md`
	3. Test that all configs remain valid

	### Training Core

	`src/ltx_trainer/trainer.py` - Main training loop (`LtxvTrainer`)

	- Implements distributed training with Accelerate
	- Handles mixed precision, gradient accumulation, checkpointing
	- `_training_step()` applies embedding connectors then delegates to strategy
	- `_load_text_encoder_and_cache_embeddings()` loads the text encoder + embeddings processor, caches validation
	embeddings, then unloads the Gemma LLM (keeps only the embeddings processor connectors for training)
	- Uses training strategies for mode-specific logic

	`src/ltx_trainer/training_strategies/` - Strategy pattern

	- `base_strategy.py`: `TrainingStrategy` ABC, `ModelInputs` dataclass
	- `flexible.py`: FlexibleStrategy — unified conditioning framework (recommended)
	- `text_to_video.py`: TextToVideoStrategy (deprecated — use FlexibleStrategy)
	- `video_to_video.py`: VideoToVideoStrategy (deprecated — use FlexibleStrategy)

	Key methods each strategy implements:

	- `prepare_training_inputs()` - Convert batch to `ModelInputs` with `Modality` objects
	- `compute_loss()` - Calculate training loss (velocity prediction, MSE with masking)

	The strategy's config declares its data directories via `get_data_sources()` (single source of truth, used for both dataset wiring and existence validation).

	`src/ltx_trainer/model_loader.py` - Model loading

	Component loaders:

	- `load_transformer()` → `LTXModel`
	- `load_video_vae_encoder()` → `VideoEncoder`
	- `load_video_vae_decoder()` → `VideoDecoder`
	- `load_audio_vae_decoder()` → `AudioDecoder`
	- `load_vocoder()` → `Vocoder` or `VocoderWithBWE` (auto-detected)
	- `load_text_encoder(gemma_model_path)` → `GemmaTextEncoder` (pure Gemma LLM, no checkpoint needed)
	- `load_embeddings_processor(checkpoint_path)` → `EmbeddingsProcessor` (feature extractor + connectors)
	- `load_model()` → `LtxModelComponents` (convenience wrapper)

	`src/ltx_trainer/validation_runner.py` - Conditioned validation sampling

	- Manages the full validation lifecycle: embedding caching, media encoding, denoising, decoding
	- Supports all validation condition types: first_frame, prefix, suffix, spatial_crop, mask, reference, video_to_audio, audio_to_video
	- Handles frozen modality paths (sigma=0 for conditioning modality)
	- Builds conditioning items using ltx-core's `VideoConditionByLatentIndex`, `VideoConditionByReferenceLatent`, `VideoConditionByMask`
	- Optional side-by-side reference output for IC-LoRA validation

	`src/ltx_trainer/timestep_samplers.py` - Flow matching timestep sampling

	- `UniformTimestepSampler` - Uniform sampling in `[min, max]`
	- `ShiftedLogitNormalTimestepSampler` - Stretched shifted logit-normal distribution with:
	- Shift determined by sequence length (more noise at higher token counts)
	- Percentile stretching for better `[0, 1]` coverage
	- Uniform fallback (10% of samples) to prevent distribution collapse
	- Reflection around `eps` for numerical stability near zero

	`src/ltx_trainer/gemma_8bit.py` - 8-bit text encoder loading

	Bypasses ltx-core's standard loading path to enable bitsandbytes 8-bit quantization of the Gemma backbone. Manually
	constructs the `GemmaTextEncoder` with quantized model, feature extractor, and embeddings processor.

	### Data

	`src/ltx_trainer/datasets.py` - Dataset handling

	- `PrecomputedDataset` loads pre-computed VAE latents and text embeddings
	- Supports video latents, audio latents, text embeddings, reference video latents, reference audio latents, video masks, and audio masks
	- Handles legacy patchified format `[seq_len, C]` → automatically unpatchifies to `[C, F, H, W]`
	- `DummyDataset` for benchmarking and minimal testing

	## Common Development Tasks

	### Agent-Assisted Training

	When a user asks to train, fine-tune, create a LoRA, or produce a custom LTX-2 model, use the repository skill at
	[`.claude/skills/train-model`](../../.claude/skills/train-model/SKILL.md). The skill is the orchestrator for dataset probing, mode selection, preprocessing,
	training launch, monitoring, and post-train validation; it treats `packages/ltx-trainer/docs/` as the source of truth.

	### Adding a New Configuration Parameter

	1. Add field to appropriate config class in `src/ltx_trainer/config.py`
	2. Add validator if needed
	3. Update ALL config files in `configs/`
	4. Update `docs/configuration-reference.md`

	### Implementing a New Training Strategy

	The `FlexibleStrategy` now covers all use cases (T2V, T2A, I2V, V2V, A2A, AV2AV, inpainting, outpainting, extension, A2V, V2A, IC-LoRA) through
	configuration alone. A new strategy is only needed for fundamentally different training paradigms that cannot be
	expressed via `ModalityConfig` + `ConditionConfig` combinations.

	If you do need a new strategy:

	1. Create new file in `src/ltx_trainer/training_strategies/`
	2. Create config class inheriting `TrainingStrategyConfigBase` and implement `get_data_sources()`
	3. Create strategy class inheriting `TrainingStrategy`
	4. Implement: `prepare_training_inputs()`, `compute_loss()`
	5. Add to `__init__.py`: import, add to `TrainingStrategyConfig` union, update factory
	6. Add discriminator tag to config.py's `TrainingStrategyConfig`
	7. Create example config file in `configs/`

	### Working with Modalities

	```python
	from dataclasses import replace
	from ltx_core.model.transformer.modality import Modality

	# Create modality — all fields except enabled and masks are required
	video = Modality(
	enabled=True,
	latent=latents, # [B, seq_len, 128]
	sigma=sigma, # [B,] — the per-batch noise level
	timesteps=timesteps, # [B, seq_len] — per-token (sigma * denoise_mask)
	positions=positions, # [B, 3, seq_len, 2]
	context=context, # text embeddings from embeddings_processor
	context_mask=None,
	)

	# Update (immutable — must use replace)
	video = replace(video, latent=new_latent, sigma=new_sigma, timesteps=new_timesteps)

	# Disable a modality
	audio = replace(audio, enabled=False)
	```

	### Working with the Text Encoder

	```python
	# Full forward pass (used for validation — runs all 3 blocks)
	video_embeds, audio_embeds, attention_mask = text_encoder(prompt)

	# Precompute features (used in process_captions.py — runs blocks 1+2 only)
	video_features, audio_features, attention_mask = text_encoder.precompute(prompt, padding_side="left")

	# Apply connectors during training (block 3 only)
	additive_mask = text_encoder._convert_to_additive_mask(attention_mask, video_features.dtype)
	video_embeds, audio_embeds, binary_mask = text_encoder.embeddings_processor.create_embeddings(
	video_features, audio_features, additive_mask
	)
	```

	## Debugging Tips

	Training Issues:

	- Check logs first (rich logger provides context)
	- GPU memory: Look for OOM errors, enable `enable_gradient_checkpointing: true`
	- Distributed training: Check `accelerator.state` and device placement

	Model Loading:

	- Ensure `model_path` points to a local `.safetensors` file
	- Ensure `text_encoder_path` points to a Gemma model directory
	- URLs are NOT supported for model paths
	- For 8-bit loading: ensure `bitsandbytes` is installed

	Configuration:

	- Validation errors: Check validators in `config.py`
	- Unknown fields: Config uses `extra="forbid"` — all fields must be defined
	- FlexibleStrategy requires at least one modality with `is_generated: true`
	- Audio modality cannot use `first_frame` or `spatial_crop` conditions

	Precomputed Data:

	- Legacy data (`prompt_embeds`) works via backward-compat in `_training_step()`
	- New data (`video_prompt_embeds` + `audio_prompt_embeds`) is the expected format
	- Latents must be in `[C, F, H, W]` format (legacy `[seq_len, C]` is auto-converted)

	## Key Constraints

	### Frame Requirements

	Frames must satisfy `frames % 8 == 1`:

	- ✅ Valid: 1, 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 121
	- ❌ Invalid: 24, 32, 48, 64, 100

	### Resolution Requirements

	Width and height must be divisible by 32.

	### Model Paths

	- Must be local paths (URLs not supported)
	- `model_path`: Path to `.safetensors` checkpoint
	- `text_encoder_path`: Path to Gemma model directory

	### Platform Requirements

	- Linux required (uses `triton` which is Linux-only)
	- CUDA GPU with 32GB+ VRAM recommended

	## Reference: ltx-core Key Components

	```
	packages/ltx-core/src/ltx_core/
	├── model/
	│ ├── transformer/
	│ │ ├── model.py # LTXModel (diffusion transformer)
	│ │ ├── modality.py # Modality dataclass
	│ │ ├── transformer.py # BasicAVTransformerBlock
	│ │ ├── transformer_args.py # TransformerArgsPreprocessor (sigma → prompt AdaLN)
	│ │ ├── model_configurator.py # LTXModelConfigurator (version-aware)
	│ │ └── timestep_embedding.py # Timestep/sigma embedding
	│ ├── video_vae/
	│ │ ├── video_vae.py # VideoEncoder, VideoDecoder
	│ │ └── model_configurator.py # VideoEncoderConfigurator, VideoDecoderConfigurator
	│ ├── audio_vae/
	│ │ ├── audio_vae.py # AudioEncoder, AudioDecoder
	│ │ └── vocoder.py # Vocoder, VocoderWithBWE (output_sampling_rate)
	│ └── common/ # Shared model components
	├── text_encoders/gemma/
	│ ├── __init__.py # Exports: GemmaTextEncoder, GemmaTextEncoderConfigurator,
	│ │ # AV_GEMMA_TEXT_ENCODER_KEY_OPS, GEMMA_MODEL_OPS,
	│ │ # module_ops_from_gemma_root
	│ ├── encoders/
	│ │ ├── base_encoder.py # GemmaTextEncoder (unified 3-block pipeline)
	│ │ └── encoder_configurator.py # GemmaTextEncoderConfigurator, _create_feature_extractor
	│ ├── feature_extractor.py # FeatureExtractorV1 (19B), FeatureExtractorV2 (22B)
	│ ├── embeddings_connector.py # Embeddings1DConnector, Embeddings1DConnectorConfigurator,
	│ │ # AudioEmbeddings1DConnectorConfigurator
	│ ├── embeddings_processor.py # EmbeddingsProcessor (wraps video + audio connectors)
	│ └── tokenizer.py # LTXVGemmaTokenizer
	├── components/
	│ ├── schedulers.py # LTX2Scheduler
	│ ├── diffusion_steps.py # EulerDiffusionStep
	│ ├── guiders.py # CFGGuider, STGGuider
	│ └── patchifiers.py # VideoLatentPatchifier, AudioPatchifier
	├── conditioning/ # ConditioningItem, mask_utils, types
	├── tools.py # VideoLatentTools, AudioLatentTools
	├── loader/
	│ ├── single_gpu_model_builder.py # SingleGPUModelBuilder
	│ ├── sft_loader.py # SafetensorsModelStateDictLoader
	│ └── sd_ops.py # Key remapping (SDOps)
	└── types.py # SpatioTemporalScaleFactors, VideoLatentShape, AudioLatentShape
	```

Xet Storage Details

Size:: 31.9 kB
Xet hash:: 4026ab01ec08198527f48ec6533931a5ef3e8cb58a7296b156cac8109807fb46

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.