--- library_name: mlx pipeline_tag: text-to-speech tags: - indextts2 - mlx-indextts - voice-cloning - fp16 - zh - en - text-to-speech - apple-silicon - mlx license: mit --- # mlx-indextts2-standard-fp16 This is a converted MLX IndexTTS2 model for Apple Silicon inference with [`solar2ain/mlx-indextts`](https://github.com/solar2ain/mlx-indextts). It was prepared for the local `/Users/vanch/index-tts` IndexTTS2 optimization project, where the goal was stable Vietnamese and multilingual TTS on an M3 Max Mac without PyTorch MPS memory crashes. ## Variant - Profile: **Standard multilingual** - Precision / quantization: **fp16** - Approx local size: **2.0GB** - Source checkpoint directory during conversion: `/Users/vanch/index-tts/checkpoints` - Note: All floating MLX weights cast to fp16 from the standard fp32 conversion. - Conversion detail: Derived locally by casting floating MLX safetensors to `float16`; this is not an upstream CLI quantization mode. ## Expected Files The repository root is a ready-to-use MLX IndexTTS2 model directory: - `gpt.safetensors` - `s2mel.safetensors` - `bigvgan.safetensors` - `vq2emb.safetensors` - `tokenizer.model` - `config.yaml` - `config.json` - `feat1.pt` - `feat2.pt` - `wav2vec2bert_stats.pt` ## Usage Install and use `mlx-indextts`: ```bash git clone https://github.com/solar2ain/mlx-indextts.git cd mlx-indextts uv sync --extra convert --extra v2 huggingface-cli download vanch007/mlx-indextts2-standard-fp16 \ --local-dir models/mlx-indextts2-standard-fp16 \ --local-dir-use-symlinks False uv run mlx-indextts generate \ -m models/mlx-indextts2-standard-fp16 \ -r /path/to/reference_or_speaker.npz \ -t "Your text here" \ -o output.wav \ --memory-limit 24 \ --diffusion-steps 16 ``` For repeated generation, precompute speaker conditioning first: ```bash uv run mlx-indextts speaker \ -m models/mlx-indextts2-standard-fp16 \ -r /path/to/reference.wav \ -o speaker.npz \ --memory-limit 24 ``` ## Benchmark Benchmarked on a 128GB unified-memory M3 Max Mac using: - `mlx-indextts` from `solar2ain/mlx-indextts` - precomputed `.npz` speaker conditioning - `memory_limit=24GB` - `diffusion_steps=16` - emotion=`calm`, `emo_alpha=0.6` - same text set across fp32 / fp16 / 8bit / optimized PyTorch MPS RTF lower is faster: | Case | fp32 MLX RTF | fp16 MLX RTF | 8bit MLX RTF | PyTorch MPS RTF | |---|---:|---:|---:|---:| | zh short | 1.127 | 1.538 | 0.966 | 1.446 | | zh long | 1.232 | 1.584 | 1.035 | 1.699 | | en short | 1.157 | 1.462 | 0.914 | 2.192 | | en long | 1.193 | 1.511 | 0.956 | 1.783 | Summary from the local comparison: - 8bit was the fastest MLX route in this test set. - fp16 saved space but was slower than fp32 for the standard profile. - Vietnamese fp16 was slightly faster than Vietnamese fp32, but Vietnamese 8bit was fastest. ## ASR Validation ASR validation with local `mlx_whisper` + `whisper-large-v3-turbo` found no empty audio, wrong-language output, or obvious missing sentences. Chinese long-form ASR showed a minor `她/他` homophone difference; English long-form 8-bit ASR showed a minor tense difference. ASR was used only as an automated sanity check. Final production selection should still include human listening, especially for long-form Vietnamese narration. ## Provenance and Scope This is an MLX conversion for local Apple Silicon inference, not the original PyTorch release. The original implementation and model family are associated with IndexTTS / IndexTTS2; the MLX runtime used here is `solar2ain/mlx-indextts`. The benchmark numbers are environment-specific and should be treated as local M3 Max results, not universal performance guarantees.