Buckets:
AGENTS.md
This file provides guidance to AI coding assistants (Claude, Cursor, etc.) when working with code in this repository.
Project Overview
LTX Trainer is a training toolkit for fine-tuning the Lightricks LTX audio-video generation models. It supports:
- Text-to-video (T2V) - Generate video from text prompts
- Text-to-audio (T2A) - Generate audio from text prompts
- Image-to-video (I2V) - Generate video conditioned on a first frame
- Video extension - Forward (prefix) and backward (suffix) video continuation
- Video inpainting - Mask-based spatial/temporal inpainting
- Video outpainting - Spatial crop-based outpainting
- IC-LoRA video-to-video - In-context control adapters for style/structure transfer
- Audio-to-video (A2V) and Video-to-audio (V2A) - Cross-modal generation with frozen conditioning
- Audio extension - Forward (prefix) and backward (suffix) audio continuation
- Audio inpainting - Mask-based audio inpainting
- IC-LoRA audio-to-audio (A2A) - Audio reference conditioning for style transfer
- AV2AV IC-LoRA - Combined video and audio reference conditioning
- LoRA training - Efficient fine-tuning with adapters
- Full fine-tuning - Complete model training
All conditioning scenarios are expressed through the unified FlexibleStrategy configuration.
Supported model versions:
- LTX-2 (19B, initial audio-video model)
- LTX-2.3 (22B, improved text conditioning and audio quality)
Version detection is fully automatic — ltx-core reads the checkpoint config and selects the correct architecture components. The trainer does not need version-specific code paths.
Key Dependencies:
ltx-core- Core model implementations (transformer, VAE, text encoder, scheduler)ltx-pipelines- Inference pipeline components
Important: This trainer only supports LTX-2 and later (audio-video models). The older LTXV (video-only) models are not supported.
Architecture Overview
Package Structure
packages/ltx-trainer/
├── src/ltx_trainer/ # Main training module
│ ├── __init__.py # Logger setup, path config
│ ├── config.py # Pydantic configuration models
│ ├── config_display.py # Config pretty-printing
│ ├── trainer.py # Main training orchestration with Accelerate
│ ├── model_loader.py # Model loading using ltx-core
│ ├── validation_runner.py # ValidationRunner — conditioned validation sampling
│ ├── datasets.py # PrecomputedDataset, DummyDataset
│ ├── training_strategies/ # Strategy pattern for different training modes
│ │ ├── __init__.py # Factory function: get_training_strategy()
│ │ ├── base_strategy.py # TrainingStrategy ABC, ModelInputs, TrainingStrategyConfigBase
│ │ ├── flexible.py # FlexibleStrategy, FlexibleStrategyConfig [RECOMMENDED]
│ │ ├── text_to_video.py # TextToVideoStrategy, TextToVideoConfig [DEPRECATED]
│ │ └── video_to_video.py # VideoToVideoStrategy, VideoToVideoConfig [DEPRECATED]
│ ├── timestep_samplers.py # Flow matching timestep sampling
│ ├── gemma_8bit.py # 8-bit Gemma text encoder loading (bitsandbytes)
│ ├── quantization.py # Transformer INT8/INT4/FP8 quantization
│ ├── captioning.py # Video captioning utilities
│ ├── video_utils.py # Video I/O and processing
│ ├── gpu_utils.py # GPU memory helpers
│ ├── hf_hub_utils.py # HuggingFace Hub integration
│ ├── progress.py # Training progress display
│ └── utils.py # Image I/O helpers
├── scripts/ # User-facing CLI tools
│ ├── train.py # Main training script
│ ├── process_dataset.py # Dataset preprocessing (latents + captions)
│ ├── process_videos.py # Video latent encoding
│ ├── process_captions.py # Text embedding computation
│ ├── caption_videos.py # Automatic video captioning
│ ├── decode_latents.py # Latent decoding for debugging
│ ├── compute_reference.py # Generate IC-LoRA reference videos
│ └── split_scenes.py # Scene detection and splitting
├── configs/ # Example training configurations
│ ├── t2v_lora.yaml # Text-to-video LoRA
│ ├── t2v_lora_low_vram.yaml # Text-to-video LoRA (low VRAM)
│ ├── i2v_lora.yaml # Image-to-video LoRA
│ ├── v2v_ic_lora.yaml # IC-LoRA video-to-video
│ ├── a2v_lora.yaml # Audio-to-video LoRA
│ ├── v2a_lora.yaml # Video-to-audio LoRA
│ ├── video_extend_lora.yaml # Video extension (forward)
│ ├── video_suffix_lora.yaml # Video extension (backward)
│ ├── video_inpainting_lora.yaml # Video inpainting
│ ├── video_outpainting_lora.yaml # Video outpainting
│ ├── t2a_lora.yaml # Text-to-audio LoRA
│ ├── audio_extend_lora.yaml # Audio extension (forward)
│ ├── audio_suffix_lora.yaml # Audio extension (backward)
│ ├── audio_inpainting_lora.yaml # Audio inpainting
│ ├── a2a_ic_lora.yaml # Audio-to-audio IC-LoRA
│ ├── av2av_ic_lora.yaml # AV2AV IC-LoRA
│ └── accelerate/ # FSDP, DDP configs
├── tests/ # Pytest tests
└── docs/ # Documentation
Key Architectural Patterns
Model Loading:
ltx_trainer.model_loaderprovides component loaders usingltx-core- Individual loaders:
load_transformer(),load_video_vae_encoder(),load_video_vae_decoder(),load_text_encoder(),load_embeddings_processor(), etc. - Combined loader:
load_model()returnsLtxModelComponentsdataclass - Uses
SingleGPUModelBuilderfrom ltx-core internally - Text encoder and embeddings processor are loaded separately (the text encoder only needs Gemma weights; the embeddings processor only needs the LTX checkpoint)
- 8-bit text encoder loading via
gemma_8bit.py(bitsandbytes)
Training Flow:
- Configuration loaded via Pydantic models in
config.py LtxvTrainerclass orchestrates the training loop- Text encoder loaded on GPU → validation embeddings cached → heavy components unloaded (only
embeddings_processorkept) - Each training step: embedding connectors applied → strategy prepares
ModelInputs→ transformer forward pass → strategy computes loss - Training strategies (
FlexibleStrategy) handle mode-specific logic including conditioning, masking, and loss computation - Accelerate handles distributed training, mixed precision, and device placement
- Data flows as precomputed latents through
PrecomputedDataset
Model Interface (Modality-based):
from ltx_core.model.transformer.modality import Modality
video = Modality(
enabled=True,
latent=video_latents, # [B, seq_len, 128] patchified latent tokens
sigma=sigma, # [B,] current noise level (per-batch)
timesteps=video_timesteps, # [B, seq_len] per-token timestep embeddings
positions=video_positions, # [B, 3, seq_len, 2] positional coordinates
context=video_embeds, # text conditioning embeddings
context_mask=None, # optional attention mask for text context
)
audio = Modality(
enabled=True,
latent=audio_latents,
sigma=sigma,
timesteps=audio_timesteps,
positions=audio_positions, # [B, 1, seq_len, 2]
context=audio_embeds,
context_mask=None,
)
# Forward pass returns predictions for both modalities
video_pred, audio_pred = model(video=video, audio=audio, perturbations=None)
Note:
Modalityis immutable (frozen dataclass). Usedataclasses.replace()to modify.
sigma vs timesteps: These serve different roles. timesteps is per-token (e.g. sigma * denoise_mask —
conditioning tokens get 0, noisy tokens get sigma). sigma is per-batch and is used for prompt AdaLN conditioning (
LTX-2.3) and cross-modality (video↔audio) attention conditioning (both versions).
Configuration System:
- All config in
src/ltx_trainer/config.py - Main class:
LtxTrainerConfig TrainingStrategyConfig- Union ofFlexibleStrategyConfig|TextToVideoConfig(deprecated) |VideoToVideoConfig(deprecated)FlexibleStrategyConfig- Unified strategy config withvideo/audioModalityConfigblocksModalityConfig- Per-modality config:is_generated,latents_dir,conditionslistConditionConfig- Discriminated union:FirstFrameConditionConfig,PrefixConditionConfig,SuffixConditionConfig,SpatialCropConditionConfig,MaskConditionConfig,ReferenceConditionConfigValidationSample- Per-sample validation config withprompt,conditions, optionalvideo_dims/seedoverridesValidationCondition- Discriminated union for validation conditions (first_frame, prefix, suffix, spatial_crop, mask, reference, video_to_audio, audio_to_video)- Uses Pydantic field validators and model validators
- Config uses
extra="forbid"— unknown fields cause validation errors - Config files in
configs/directory
LTX-2 vs LTX-2.3: Differences
Both model versions share the same latent space interface (see Latent Space Constants). The differences lie in how text conditioning and audio generation work. Version detection is automatic via checkpoint config — the trainer uses a unified API.
| Component | LTX-2 (19B) | LTX-2.3 (22B) |
|---|---|---|
| Feature extractor | FeatureExtractorV1: single aggregate_embed, same output for video and audio |
FeatureExtractorV2: separate video_aggregate_embed + audio_aggregate_embed, per-token RMSNorm |
| Caption projection | Inside the transformer (caption_projection) |
Inside the feature extractor (before connector) |
| Embeddings connectors | Same dimensions for video and audio | Separate dimensions (AudioEmbeddings1DConnectorConfigurator) |
| Prompt AdaLN | Not present (cross_attention_adaln=False) |
Active — modulates cross-attention to text using sigma |
| Vocoder | HiFi-GAN (Vocoder) |
BigVGAN v2 + bandwidth extension (VocoderWithBWE) |
How version detection works in ltx-core:
- Feature extractor:
_create_feature_extractor()checks for V2 config keys (caption_proj_before_connector, etc.). Present → V2; absent → V1. - Vocoder:
VocoderConfiguratorchecks forconfig["vocoder"]["bwe"]. Present →VocoderWithBWE; absent →Vocoder. - Transformer:
_build_caption_projections()checkscaption_proj_before_connector. True (V2) → no caption projection in transformer; False (V1) → caption projection created in transformer. - Embeddings connectors:
AudioEmbeddings1DConnectorConfiguratorreadsaudio_connector_*keys, falling back to video connector keys for V1 backward compatibility.
Text Encoder Pipeline
The GemmaTextEncoder implements a 3-block pipeline:
- Block 1 — Gemma LLM: Tokenizes text → runs through Gemma → extracts hidden states
- Block 2 — Feature extractor: Hidden states → normalized features (V1: single stream duplicated for video/audio; V2: separate video and audio projections)
- Block 3 — Embeddings processor: Features → embeddings connectors → final context embeddings for the transformer
Precomputed embeddings (offline): process_captions.py runs Blocks 1+2 via text_encoder.precompute() and saves
the results. Block 3 (connectors) is applied during training via
text_encoder.embeddings_processor.create_embeddings().
Precomputed embeddings formats:
- New format (from
precompute()): savesvideo_prompt_embeds,audio_prompt_embeds(optional),prompt_attention_mask - Legacy format (from old
_preprocess_text()): savesprompt_embeds,prompt_attention_mask
The trainer handles both formats in _training_step(): if video_prompt_embeds is present, it uses the new format;
otherwise, it duplicates prompt_embeds for both modalities (mirroring V1 behavior).
After caching validation embeddings, the trainer unloads heavy components to free VRAM:
self._text_encoder.model = None
self._text_encoder.tokenizer = None
self._text_encoder.feature_extractor = None
# Only embeddings_processor (connectors) remains — used during training
Latent Space Constants
These values are shared across all supported model versions:
| Constant | Value | Where used |
|---|---|---|
| Video latent channels | 128 | VAE encoder/decoder, patchifier, VideoLatentShape |
| Spatial compression | 32× (H and W) | SpatioTemporalScaleFactors.default(), config validators |
| Temporal compression | 8× | SpatioTemporalScaleFactors.default(), config validators |
| Frame constraint | frames % 8 == 1 |
Config validators, validation runner |
| Resolution constraint | Width and height divisible by 32 | Config validators, validation runner |
| Audio latent channels | 8 | AudioLatentShape, audio patchifier |
| Audio mel bins | 16 | AudioLatentShape, audio patchifier |
| Patchified token dim (video) | 128 (128 × 1 × 1 × 1) |
Transformer in_channels |
| Patchified token dim (audio) | 128 (8 × 16) |
Transformer audio_in_channels |
Development Commands
Setup and Installation
# From the repository root
uv sync
cd packages/ltx-trainer
Code Quality
# Run ruff linting and formatting
uv run ruff check .
uv run ruff format .
# Run pre-commit checks
uv run pre-commit run --all-files
Running Tests
cd packages/ltx-trainer
uv run pytest
Running Training
# Single GPU
uv run python scripts/train.py configs/t2v_lora.yaml
# Multi-GPU with Accelerate
uv run accelerate launch scripts/train.py configs/t2v_lora.yaml
Testing Standards
Structure
- Flat functions only — use
def test_*(), neverclass Test*with methods. Pytest collects standalone functions. - Only test public interfaces — never call private methods (
_method) directly. Verify private behavior indirectly through the public API.
What to Test
- Custom validators and business logic — cross-field validators, domain constraints, error paths. These catch real bugs.
- Behavioral tests — call the public method, verify the outputs have the right shape, values, and structure. One behavioral test is worth ten config-only tests.
- Edge cases and error paths — boundary conditions, composed behaviors, expected exceptions.
- Contract tests — required fields, rejected invalid inputs, safety mechanisms like
extra="forbid".
What NOT to Test
- Pydantic storing a value —
Foo(x=1); assert foo.x == 1tests Pydantic, not your code. If a behavioral test already creates the same config and uses it, the config-only test adds nothing. - Pydantic Literal defaults —
assert config.type == "first_frame"whentypeisLiteral["first_frame"]. - Pydantic default factories —
assert config.conditions == []when the field hasdefault_factory=list. - Tests already covered by behavioral tests — if
test_prefix_conditioningcreates a validPrefixConditionConfigand exercises it end-to-end, a separatetest_prefix_validthat just creates the same config is redundant. - Trivial instantiation tests —
strategy = Strategy(config); assert strategy.config is not Nonewhen every other test creates a strategy.
Keeping Tests DRY
- Use helper functions for repeated setup patterns (e.g.,
_make_strategy(video=_video_modality(...))instead of 6-8 lines of config/strategy creation per test). - Use named constants for test dimensions (e.g.,
VIDEO_SEQ_LEN,TOKENS_PER_FRAME) instead of magic numbers. - Merge tests that share identical setup — when 5+ tests call
prepare_training_inputswith the exact same config and batch, each checking one assertion, merge them into one test that checks all assertions. Pytest reports the exact failing line anyway. - Use
@pytest.mark.parametrizefor the same logic tested with different inputs (e.g., valid/invalid values for a field). - Use pytest fixtures for shared batch data and test directories, but prefer explicit helper functions over fixtures for strategy/config creation (makes the test self-documenting).
Code Standards
Type Hints
- Always use type hints for all function arguments and return values
- Use Python 3.10+ syntax:
list[str]notList[str],str | PathnotUnion[str, Path] - Use
pathlib.Pathfor file operations
Class Methods
- Mark methods as
@staticmethodif they don't access instance or class state - Use
@classmethodfor alternative constructors
AI/ML Specific
- Use
@torch.inference_mode()for inference (prefer over@torch.no_grad()) - Use
accelerator.devicefor distributed compatibility - Support mixed precision (
bfloat16via dtype parameters) - Use gradient checkpointing for memory-intensive training
Logging
- Use
from ltx_trainer import loggerfor all messages - Avoid print statements in production code
Important Files & Modules
Configuration (CRITICAL)
src/ltx_trainer/config.py - Master config definitions
Key classes:
LtxTrainerConfig- Main configuration containerModelConfig- Model paths, training mode (lora|full), checkpoint loadingTrainingStrategyConfig- Union ofFlexibleStrategyConfig|TextToVideoConfig(deprecated) |VideoToVideoConfig(deprecated)FlexibleStrategyConfig- Unified strategy config withvideo/audioModalityConfigblocksModalityConfig- Per-modality config:is_generated,latents_dir,conditionslistConditionConfig- Discriminated union:FirstFrameConditionConfig,PrefixConditionConfig,SuffixConditionConfig,SpatialCropConditionConfig,MaskConditionConfig,ReferenceConditionConfigValidationSample- Per-sample validation config withprompt,conditions, optionalvideo_dims/seedoverridesValidationCondition- Discriminated union for validation conditions (first_frame, prefix, suffix, spatial_crop, mask, reference, video_to_audio, audio_to_video)LoraConfig- Rank, alpha, dropout, target modulesOptimizationConfig- Learning rate, batch size, gradient accumulation, scheduler, gradient checkpointingAccelerationConfig- Mixed precision, quantization, 8-bit text encoderDataConfig- Preprocessed data root, dataloader workersValidationConfig- Prompts, video dimensions, CFG/STG guidance, audio generation, inference stepsCheckpointsConfig- Save interval, retention, precisionFlowMatchingConfig- Timestep sampling mode and parametersHubConfig- HuggingFace Hub push settingsWandbConfig- Weights & Biases logging
⚠️ When modifying config.py:
- Update ALL config files in
configs/ - Update
docs/configuration-reference.md - Test that all configs remain valid
Training Core
src/ltx_trainer/trainer.py - Main training loop (LtxvTrainer)
- Implements distributed training with Accelerate
- Handles mixed precision, gradient accumulation, checkpointing
_training_step()applies embedding connectors then delegates to strategy_load_text_encoder_and_cache_embeddings()loads the text encoder + embeddings processor, caches validation embeddings, then unloads the Gemma LLM (keeps only the embeddings processor connectors for training)- Uses training strategies for mode-specific logic
src/ltx_trainer/training_strategies/ - Strategy pattern
base_strategy.py:TrainingStrategyABC,ModelInputsdataclassflexible.py: FlexibleStrategy — unified conditioning framework (recommended)text_to_video.py: TextToVideoStrategy (deprecated — use FlexibleStrategy)video_to_video.py: VideoToVideoStrategy (deprecated — use FlexibleStrategy)
Key methods each strategy implements:
prepare_training_inputs()- Convert batch toModelInputswithModalityobjectscompute_loss()- Calculate training loss (velocity prediction, MSE with masking)
The strategy's config declares its data directories via get_data_sources() (single source of truth, used for both dataset wiring and existence validation).
src/ltx_trainer/model_loader.py - Model loading
Component loaders:
load_transformer()→LTXModelload_video_vae_encoder()→VideoEncoderload_video_vae_decoder()→VideoDecoderload_audio_vae_decoder()→AudioDecoderload_vocoder()→VocoderorVocoderWithBWE(auto-detected)load_text_encoder(gemma_model_path)→GemmaTextEncoder(pure Gemma LLM, no checkpoint needed)load_embeddings_processor(checkpoint_path)→EmbeddingsProcessor(feature extractor + connectors)load_model()→LtxModelComponents(convenience wrapper)
src/ltx_trainer/validation_runner.py - Conditioned validation sampling
- Manages the full validation lifecycle: embedding caching, media encoding, denoising, decoding
- Supports all validation condition types: first_frame, prefix, suffix, spatial_crop, mask, reference, video_to_audio, audio_to_video
- Handles frozen modality paths (sigma=0 for conditioning modality)
- Builds conditioning items using ltx-core's
VideoConditionByLatentIndex,VideoConditionByReferenceLatent,VideoConditionByMask - Optional side-by-side reference output for IC-LoRA validation
src/ltx_trainer/timestep_samplers.py - Flow matching timestep sampling
UniformTimestepSampler- Uniform sampling in[min, max]ShiftedLogitNormalTimestepSampler- Stretched shifted logit-normal distribution with:- Shift determined by sequence length (more noise at higher token counts)
- Percentile stretching for better
[0, 1]coverage - Uniform fallback (10% of samples) to prevent distribution collapse
- Reflection around
epsfor numerical stability near zero
src/ltx_trainer/gemma_8bit.py - 8-bit text encoder loading
Bypasses ltx-core's standard loading path to enable bitsandbytes 8-bit quantization of the Gemma backbone. Manually
constructs the GemmaTextEncoder with quantized model, feature extractor, and embeddings processor.
Data
src/ltx_trainer/datasets.py - Dataset handling
PrecomputedDatasetloads pre-computed VAE latents and text embeddings- Supports video latents, audio latents, text embeddings, reference video latents, reference audio latents, video masks, and audio masks
- Handles legacy patchified format
[seq_len, C]→ automatically unpatchifies to[C, F, H, W] DummyDatasetfor benchmarking and minimal testing
Common Development Tasks
Agent-Assisted Training
When a user asks to train, fine-tune, create a LoRA, or produce a custom LTX-2 model, use the repository skill at
.claude/skills/train-model. The skill is the orchestrator for dataset probing, mode selection, preprocessing,
training launch, monitoring, and post-train validation; it treats packages/ltx-trainer/docs/ as the source of truth.
Adding a New Configuration Parameter
- Add field to appropriate config class in
src/ltx_trainer/config.py - Add validator if needed
- Update ALL config files in
configs/ - Update
docs/configuration-reference.md
Implementing a New Training Strategy
The FlexibleStrategy now covers all use cases (T2V, T2A, I2V, V2V, A2A, AV2AV, inpainting, outpainting, extension, A2V, V2A, IC-LoRA) through
configuration alone. A new strategy is only needed for fundamentally different training paradigms that cannot be
expressed via ModalityConfig + ConditionConfig combinations.
If you do need a new strategy:
- Create new file in
src/ltx_trainer/training_strategies/ - Create config class inheriting
TrainingStrategyConfigBaseand implementget_data_sources() - Create strategy class inheriting
TrainingStrategy - Implement:
prepare_training_inputs(),compute_loss() - Add to
__init__.py: import, add toTrainingStrategyConfigunion, update factory - Add discriminator tag to config.py's
TrainingStrategyConfig - Create example config file in
configs/
Working with Modalities
from dataclasses import replace
from ltx_core.model.transformer.modality import Modality
# Create modality — all fields except enabled and masks are required
video = Modality(
enabled=True,
latent=latents, # [B, seq_len, 128]
sigma=sigma, # [B,] — the per-batch noise level
timesteps=timesteps, # [B, seq_len] — per-token (sigma * denoise_mask)
positions=positions, # [B, 3, seq_len, 2]
context=context, # text embeddings from embeddings_processor
context_mask=None,
)
# Update (immutable — must use replace)
video = replace(video, latent=new_latent, sigma=new_sigma, timesteps=new_timesteps)
# Disable a modality
audio = replace(audio, enabled=False)
Working with the Text Encoder
# Full forward pass (used for validation — runs all 3 blocks)
video_embeds, audio_embeds, attention_mask = text_encoder(prompt)
# Precompute features (used in process_captions.py — runs blocks 1+2 only)
video_features, audio_features, attention_mask = text_encoder.precompute(prompt, padding_side="left")
# Apply connectors during training (block 3 only)
additive_mask = text_encoder._convert_to_additive_mask(attention_mask, video_features.dtype)
video_embeds, audio_embeds, binary_mask = text_encoder.embeddings_processor.create_embeddings(
video_features, audio_features, additive_mask
)
Debugging Tips
Training Issues:
- Check logs first (rich logger provides context)
- GPU memory: Look for OOM errors, enable
enable_gradient_checkpointing: true - Distributed training: Check
accelerator.stateand device placement
Model Loading:
- Ensure
model_pathpoints to a local.safetensorsfile - Ensure
text_encoder_pathpoints to a Gemma model directory - URLs are NOT supported for model paths
- For 8-bit loading: ensure
bitsandbytesis installed
Configuration:
- Validation errors: Check validators in
config.py - Unknown fields: Config uses
extra="forbid"— all fields must be defined - FlexibleStrategy requires at least one modality with
is_generated: true - Audio modality cannot use
first_frameorspatial_cropconditions
Precomputed Data:
- Legacy data (
prompt_embeds) works via backward-compat in_training_step() - New data (
video_prompt_embeds+audio_prompt_embeds) is the expected format - Latents must be in
[C, F, H, W]format (legacy[seq_len, C]is auto-converted)
Key Constraints
Frame Requirements
Frames must satisfy frames % 8 == 1:
- ✅ Valid: 1, 9, 17, 25, 33, 41, 49, 57, 65, 73, 81, 89, 97, 121
- ❌ Invalid: 24, 32, 48, 64, 100
Resolution Requirements
Width and height must be divisible by 32.
Model Paths
- Must be local paths (URLs not supported)
model_path: Path to.safetensorscheckpointtext_encoder_path: Path to Gemma model directory
Platform Requirements
- Linux required (uses
tritonwhich is Linux-only) - CUDA GPU with 32GB+ VRAM recommended
Reference: ltx-core Key Components
packages/ltx-core/src/ltx_core/
├── model/
│ ├── transformer/
│ │ ├── model.py # LTXModel (diffusion transformer)
│ │ ├── modality.py # Modality dataclass
│ │ ├── transformer.py # BasicAVTransformerBlock
│ │ ├── transformer_args.py # TransformerArgsPreprocessor (sigma → prompt AdaLN)
│ │ ├── model_configurator.py # LTXModelConfigurator (version-aware)
│ │ └── timestep_embedding.py # Timestep/sigma embedding
│ ├── video_vae/
│ │ ├── video_vae.py # VideoEncoder, VideoDecoder
│ │ └── model_configurator.py # VideoEncoderConfigurator, VideoDecoderConfigurator
│ ├── audio_vae/
│ │ ├── audio_vae.py # AudioEncoder, AudioDecoder
│ │ └── vocoder.py # Vocoder, VocoderWithBWE (output_sampling_rate)
│ └── common/ # Shared model components
├── text_encoders/gemma/
│ ├── __init__.py # Exports: GemmaTextEncoder, GemmaTextEncoderConfigurator,
│ │ # AV_GEMMA_TEXT_ENCODER_KEY_OPS, GEMMA_MODEL_OPS,
│ │ # module_ops_from_gemma_root
│ ├── encoders/
│ │ ├── base_encoder.py # GemmaTextEncoder (unified 3-block pipeline)
│ │ └── encoder_configurator.py # GemmaTextEncoderConfigurator, _create_feature_extractor
│ ├── feature_extractor.py # FeatureExtractorV1 (19B), FeatureExtractorV2 (22B)
│ ├── embeddings_connector.py # Embeddings1DConnector, Embeddings1DConnectorConfigurator,
│ │ # AudioEmbeddings1DConnectorConfigurator
│ ├── embeddings_processor.py # EmbeddingsProcessor (wraps video + audio connectors)
│ └── tokenizer.py # LTXVGemmaTokenizer
├── components/
│ ├── schedulers.py # LTX2Scheduler
│ ├── diffusion_steps.py # EulerDiffusionStep
│ ├── guiders.py # CFGGuider, STGGuider
│ └── patchifiers.py # VideoLatentPatchifier, AudioPatchifier
├── conditioning/ # ConditioningItem, mask_utils, types
├── tools.py # VideoLatentTools, AudioLatentTools
├── loader/
│ ├── single_gpu_model_builder.py # SingleGPUModelBuilder
│ ├── sft_loader.py # SafetensorsModelStateDictLoader
│ └── sd_ops.py # Key remapping (SDOps)
└── types.py # SpatioTemporalScaleFactors, VideoLatentShape, AudioLatentShape
Xet Storage Details
- Size:
- 31.9 kB
- Xet hash:
- 4026ab01ec08198527f48ec6533931a5ef3e8cb58a7296b156cac8109807fb46
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.