Buckets:

linoyts's picture
|
download
raw
27.7 kB

Configuration Reference

The trainer uses structured Pydantic models for configuration, making it easy to customize training parameters. This guide covers all available configuration options and their usage.

๐Ÿ“‹ Overview

The main configuration class is LtxTrainerConfig, which includes the following sub-configurations:

  • ModelConfig: Base model and training mode settings
  • LoraConfig: LoRA training parameters
  • TrainingStrategyConfig: Training strategy settings (flexible conditioning framework)
  • OptimizationConfig: Learning rate, batch sizes, and scheduler settings
  • AccelerationConfig: Mixed precision and quantization settings
  • DataConfig: Data loading parameters
  • ValidationConfig: Validation and inference settings
  • CheckpointsConfig: Checkpoint saving frequency and retention settings
  • HubConfig: Hugging Face Hub integration settings
  • WandbConfig: Weights & Biases logging settings
  • FlowMatchingConfig: Timestep sampling parameters

๐Ÿ“„ Example Configuration Files

Check out our example configurations in the configs directory:

โš™๏ธ Configuration Sections

The YAML snippets below show recommended starting values, not necessarily the code defaults. Fields you omit from your config file will use the code defaults from config.py.

ModelConfig

Controls the base model and training mode settings.

model:
  model_path: "/path/to/ltx-2-model.safetensors"  # Local path to model checkpoint
  text_encoder_path: "/path/to/gemma-model"       # Path to Gemma text encoder directory
  training_mode: "lora"                           # "lora" or "full"
  load_checkpoint: null                           # Path to checkpoint to resume from

Key parameters:

Parameter Description
model_path Required. Local path to the LTX-2 model checkpoint (.safetensors file). URLs are not supported.
text_encoder_path Required. Path to the Gemma text encoder model directory. Download from HuggingFace.
training_mode Training approach - "lora" for LoRA training or "full" for full-rank fine-tuning.
load_checkpoint Optional path to resume training from a checkpoint file or directory.

LTX-2 requires both a model checkpoint and a Gemma text encoder. Both must be local paths.

LoraConfig

LoRA-specific fine-tuning parameters (only used when training_mode: "lora").

lora:
  rank: 32         # LoRA rank (higher = more parameters)
  alpha: 32        # LoRA alpha scaling factor
  dropout: 0.0     # Dropout probability (0.0-1.0)
  target_modules: # Modules to apply LoRA to
    - "to_k"
    - "to_q"
    - "to_v"
    - "to_out.0"

Key parameters:

Parameter Description
rank LoRA rank - higher values mean more trainable parameters (typical range: 8-128)
alpha Alpha scaling factor - typically set equal to rank
dropout Dropout probability for regularization
target_modules List of transformer modules to apply LoRA adapters to (see below)

Understanding Target Modules

The LTX-2 transformer has separate attention and feed-forward blocks for video and audio, as well as cross-attention modules that enable the two modalities to exchange information. Choosing the right target_modules is critical for achieving good results, especially when training with audio.

Video-only modules:

Module Pattern Description
attn1.to_k, attn1.to_q, attn1.to_v, attn1.to_out.0 Video self-attention
attn2.to_k, attn2.to_q, attn2.to_v, attn2.to_out.0 Video cross-attention (to text)
ff.net.0.proj, ff.net.2 Video feed-forward network

Audio-only modules:

Module Pattern Description
audio_attn1.to_k, audio_attn1.to_q, audio_attn1.to_v, audio_attn1.to_out.0 Audio self-attention
audio_attn2.to_k, audio_attn2.to_q, audio_attn2.to_v, audio_attn2.to_out.0 Audio cross-attention (to text)
audio_ff.net.0.proj, audio_ff.net.2 Audio feed-forward network

Audio-video cross-attention modules:

These modules enable bidirectional information flow between the audio and video modalities:

Module Pattern Description
audio_to_video_attn.to_k, audio_to_video_attn.to_q, audio_to_video_attn.to_v, audio_to_video_attn.to_out.0 Video attends to audio (Q from video, K/V from audio)
video_to_audio_attn.to_k, video_to_audio_attn.to_q, video_to_audio_attn.to_v, video_to_audio_attn.to_out.0 Audio attends to video (Q from audio, K/V from video)

Recommended configurations:

For video-only training, target the video attention layers:

target_modules:
  - "attn1.to_k"
  - "attn1.to_q"
  - "attn1.to_v"
  - "attn1.to_out.0"
  - "attn2.to_k"
  - "attn2.to_q"
  - "attn2.to_v"
  - "attn2.to_out.0"

For audio-video training, use patterns that match both branches:

target_modules:
  - "to_k"
  - "to_q"
  - "to_v"
  - "to_out.0"

Using shorter patterns like "to_k" will match all attention modules including attn1.to_k, audio_attn1.to_k, audio_to_video_attn.to_k, and video_to_audio_attn.to_k, effectively training video, audio, and cross-modal attention branches together.

You can also target the feed-forward (FFN) modules (ff.net.0.proj, ff.net.2 for video, audio_ff.net.0.proj, audio_ff.net.2 for audio) to increase the LoRA's capacity and potentially help it capture the target distribution better.

TrainingStrategyConfig

Configures the training strategy. The recommended strategy is "flexible", which supports all conditioning scenarios through configuration.

Flexible Strategy

The flexible strategy provides a unified conditioning framework. Each modality (video, audio) is configured independently with its own latents directory, generation flag, and list of conditions.

training_strategy:
  name: "flexible"
  video:
    is_generated: true                 # Video is denoised during training
    latents_dir: "latents"             # Directory containing precomputed video latents
    conditions:
      - type: first_frame              # Use first frame as conditioning
        probability: 0.5               # Apply this condition 50% of the time
  audio:
    is_generated: true                 # Audio is denoised during training
    latents_dir: "audio_latents"       # Directory containing precomputed audio latents
    conditions: []                     # No additional audio conditions (text-only)

ModalityConfig parameters:

Parameter Description
is_generated true = modality is denoised (contributes to loss). false = frozen conditioning (sigma=0, no loss).
latents_dir Directory name within preprocessed_data_root containing precomputed latents for this modality.
conditions List of conditioning configs applied during training (see condition types below). Text conditioning is implicit.

Condition types:

Type Parameters Description
first_frame probability First latent frame is clean, excluded from loss. Video only.
prefix temporal_boundary, probability First N latent temporal units are clean. For extension forward.
suffix temporal_boundary, probability Last N latent temporal units are clean. For extension backward.
spatial_crop spatial_region (y1, x1, y2, x2 in px), probability Rectangular region is clean, excluded from loss. For outpainting. Video only.
mask mask_dir, probability Per-sample mask directory. Masks are thresholded at 0.5; 1 means conditioning, 0 means generate.
reference latents_dir, probability IC-LoRA style concatenation. Reference tokens are prepended, clean (timestep=0), no loss.

The prefix, suffix, mask, and reference condition types work on both video and audio modalities โ€” place them in the video.conditions or audio.conditions list as appropriate. first_frame and spatial_crop are video-only conditions.

Training conditions reference directories of precomputed data (within preprocessed_data_root), while validation conditions reference individual files (images, videos, masks) that are encoded on-the-fly during validation. The condition type names are the same, but the fields differ.

The legacy text_to_video and video_to_video strategies are deprecated but remain forward-compatible. New configs should use name: "flexible".

OptimizationConfig

Training optimization parameters including learning rates, batch sizes, and schedulers.

optimization:
  learning_rate: 1e-4                  # Learning rate
  steps: 2000                          # Total training steps
  batch_size: 1                        # Batch size per GPU
  gradient_accumulation_steps: 1       # Steps to accumulate gradients
  max_grad_norm: 1.0                   # Gradient clipping threshold
  optimizer_type: "adamw"              # "adamw" or "adamw8bit"
  scheduler_type: "linear"             # Scheduler type
  scheduler_params: { }                # Additional scheduler parameters
  enable_gradient_checkpointing: true  # Memory optimization

Key parameters:

Parameter Description
learning_rate Learning rate for optimization (typical range: 1e-5 to 1e-3)
steps Total number of training steps
batch_size Batch size per GPU (reduce if running out of memory)
gradient_accumulation_steps Accumulate gradients over multiple steps
scheduler_type LR scheduler: "constant", "linear", "cosine", "cosine_with_restarts", "polynomial", "step"
enable_gradient_checkpointing Trade training speed for GPU memory savings (recommended for large models)

AccelerationConfig

Hardware acceleration and compute optimization settings.

acceleration:
  mixed_precision_mode: "bf16"                  # "no", "fp16", or "bf16"
  quantization: null                            # Quantization options
  load_text_encoder_in_8bit: false              # Load text encoder in 8-bit
  offload_optimizer_during_validation: false    # Offload optimizer state to CPU during validation

Key parameters:

Parameter Description
mixed_precision_mode Precision mode - "bf16" recommended for modern GPUs
quantization Model quantization: null, "int8-quanto", "int4-quanto", "int2-quanto", "fp8-quanto", or "fp8uz-quanto"
load_text_encoder_in_8bit Load the Gemma text encoder in 8-bit to save GPU memory
offload_optimizer_during_validation Move optimizer state to CPU before validation video sampling and back afterwards. Useful when validation OOMs because VAE decoder + transformer + optimizer state can't coexist on the GPU (full fine-tune, high-rank LoRA). No effect for FSDP.

DataConfig

Data loading and processing configuration.

data:
  preprocessed_data_root: "/path/to/preprocessed/data"  # Path to precomputed dataset
  num_dataloader_workers: 2                             # Background data loading workers

Key parameters:

Parameter Description
preprocessed_data_root Path to your preprocessed dataset directory produced by process_dataset.py (contains latents/, conditions/, etc.)
num_dataloader_workers Number of parallel data loading processes (0 = synchronous loading, useful when debugging)

ValidationConfig

Validation and inference settings for monitoring training progress. Validation samples use a self-describing format where each sample specifies its own prompt and conditions.

validation:
  samples:
    - prompt: "A cat playing with a ball"
      conditions:
        - type: first_frame
          image_or_video: "/path/to/image.png"
    - prompt: "A dog running in a field"
  video_dims: [576, 576, 89]                # Output dimensions: [width, height, frames]
  negative_prompt: "worst quality, inconsistent motion, blurry, jittery, distorted"  # Negative prompt for all samples
  frame_rate: 25.0                          # Output video frame rate (fps)
  seed: 42                                  # Random seed for reproducibility
  inference_steps: 30                       # Number of denoising steps
  interval: 100                             # Run validation every N steps (null to disable)
  guidance_scale: 4.0                       # CFG scale (higher = stronger prompt adherence)
  stg_scale: 1.0                            # STG scale (0.0 to disable)
  stg_blocks: [29]                          # Transformer blocks to apply STG perturbation
  stg_mode: "stg_av"                        # STG mode: "stg_av" (audio+video) or "stg_v" (video only)
  generate_audio: true                      # Whether to generate audio during validation
  generate_video: true                      # Whether to generate video during validation
  skip_initial_validation: false            # Skip validation at step 0

Key parameters:

Parameter Description
samples List of ValidationSample objects (see below). Replaces the legacy prompts/images/reference_videos fields.
video_dims Output dimensions [width, height, frames]. Width/height must be divisible by 32, frames must satisfy frames % 8 == 1
interval Steps between validation runs (set to null to disable)
guidance_scale CFG (Classifier-Free Guidance) scale. Recommended: 4.0
stg_scale STG (Spatio-Temporal Guidance) scale. 0.0 disables STG. Recommended: 1.0
stg_blocks Transformer blocks to perturb for STG. Recommended: [29] (single block)
stg_mode STG mode: "stg_av" perturbs both audio and video, "stg_v" perturbs video only
generate_audio Whether to generate audio in validation samples
generate_video Whether to generate video in validation samples. Set to false for V2A (video-to-audio) validation. Default: true
skip_initial_validation Skip validation video sampling at step 0 (beginning of training)

ValidationSample

Each sample in the samples list has:

Field Description
prompt Text prompt for this validation sample.
conditions List of validation conditions (see types below). Empty list = text-only generation.
video_dims Optional per-sample override for (width, height, frames). Inherits from ValidationConfig if not set.
seed Optional per-sample override for random seed. Inherits from ValidationConfig if not set.

Validation Condition Types

Type Parameters Description
first_frame image_or_video (path) Use the first frame of the image/video as conditioning.
prefix video or audio (path), optional num_frames/duration Use a video/audio clip as temporal prefix (for extension forward).
suffix video or audio (path), optional num_frames/duration Use a video/audio clip as temporal suffix (for extension backward).
spatial_crop video (path), spatial_region (y1, x1, y2, x2) Provide spatial context for outpainting. Video only.
mask video or audio (path), mask (path) Mask-based inpainting with a binary mask file.
reference video or audio (path), optional video-reference downscale_factor, temporal_scale_factor, include_in_output IC-LoRA style reference conditioning.
video_to_audio video (path) Freeze video, generate audio. For Foley/V2A tasks.
audio_to_video audio (path) Freeze audio, generate video. For audio-driven generation.

For video reference validation conditions, downscale_factor is the spatial reference scale and temporal_scale_factor is the temporal reference scale. Set both to match the factors used when preprocessing video reference latents for training; validation media is encoded on the fly and cannot infer those factors from the training dataset.

The legacy fields prompts, images, and reference_videos are deprecated but auto-converted to samples internally. New configs should use the samples format.

CheckpointsConfig

Model checkpointing configuration.

checkpoints:
  interval: 250       # Steps between checkpoint saves (null = disabled)
  keep_last_n: 3      # Number of recent checkpoints to retain
  precision: bfloat16 # Precision for saved weights (bfloat16 or float32)
  no_resume: false            # Ignore saved state, start from step 0
  save_training_state: "minimal"  # "full", "minimal", or "off"

Key parameters:

Parameter Description
interval Steps between intermediate checkpoint saves (set to null to disable)
keep_last_n Number of most recent checkpoints to keep (-1 = keep all)
precision Precision for saved checkpoint weights: "bfloat16" (default) or "float32"
no_resume When true, ignore saved training state and start from step 0. Model weights from load_checkpoint are still loaded.
save_training_state Save training state for resume: "full" (optimizer + scheduler + RNG), "minimal" (scheduler + RNG only, sufficient for LoRA), "off" (no resume).

HubConfig

Hugging Face Hub integration for automatic model uploads.

hub:
  push_to_hub: false                   # Enable Hub uploading
  hub_model_id: "username/model-name"  # Hub repository ID

Key parameters:

Parameter Description
push_to_hub Whether to automatically push trained models to Hugging Face Hub
hub_model_id Repository ID in format "username/repository-name"

WandbConfig

Weights & Biases logging configuration.

wandb:
  enabled: false               # Enable W&B logging
  project: "ltx-2-trainer"     # W&B project name
  entity: null                 # W&B username or team
  tags: [ ]                    # Tags for the run
  log_validation_videos: true  # Log validation videos to W&B

Key parameters:

Parameter Description
enabled Whether to enable W&B logging
project W&B project name
entity W&B username or team (null uses default account)
log_validation_videos Whether to log validation videos to W&B

FlowMatchingConfig

Flow matching training configuration for timestep sampling.

flow_matching:
  timestep_sampling_mode: "shifted_logit_normal"  # Timestep sampling strategy
  timestep_sampling_params: { }                   # Additional sampling parameters

Key parameters:

Parameter Description
timestep_sampling_mode Sampling strategy: "uniform" or "shifted_logit_normal"
timestep_sampling_params Additional parameters for the sampling strategy

General Configuration

Top-level settings for the training run.

seed: 42                                  # Random seed for reproducibility
output_dir: "outputs/my_training_run"     # Directory for outputs (checkpoints, validation videos, logs)
Parameter Description
seed Random seed for reproducibility (default: 42)
output_dir Directory to save outputs (default: "outputs")

๐Ÿš€ Next Steps

Once you've configured your training parameters:

Xet Storage Details

Size:
27.7 kB
ยท
Xet hash:
5acd4c05947f8c991e7dd0636766c5867caf99d16190781a5eda7134ce0c1ce7

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.