Add files using upload-large-folder tool

31118db verified 22 days ago

3.69 kB

library_name: mlx
pipeline_tag: text-to-speech
tags:
  - indextts2
  - mlx-indextts
  - voice-cloning
  - fp16
  - zh
  - en
  - text-to-speech
  - apple-silicon
  - mlx
license: mit

mlx-indextts2-standard-fp16

This is a converted MLX IndexTTS2 model for Apple Silicon inference with solar2ain/mlx-indextts.

It was prepared for the local /Users/vanch/index-tts IndexTTS2 optimization project, where the goal was stable Vietnamese and multilingual TTS on an M3 Max Mac without PyTorch MPS memory crashes.

Variant

Profile: Standard multilingual
Precision / quantization: fp16
Approx local size: 2.0GB
Source checkpoint directory during conversion: /Users/vanch/index-tts/checkpoints
Note: All floating MLX weights cast to fp16 from the standard fp32 conversion.
Conversion detail: Derived locally by casting floating MLX safetensors to float16; this is not an upstream CLI quantization mode.

Expected Files

The repository root is a ready-to-use MLX IndexTTS2 model directory:

gpt.safetensors
s2mel.safetensors
bigvgan.safetensors
vq2emb.safetensors
tokenizer.model
config.yaml
config.json
feat1.pt
feat2.pt
wav2vec2bert_stats.pt

Usage

Install and use mlx-indextts:

git clone https://github.com/solar2ain/mlx-indextts.git
cd mlx-indextts
uv sync --extra convert --extra v2

huggingface-cli download vanch007/mlx-indextts2-standard-fp16 \
  --local-dir models/mlx-indextts2-standard-fp16 \
  --local-dir-use-symlinks False

uv run mlx-indextts generate \
  -m models/mlx-indextts2-standard-fp16 \
  -r /path/to/reference_or_speaker.npz \
  -t "Your text here" \
  -o output.wav \
  --memory-limit 24 \
  --diffusion-steps 16

For repeated generation, precompute speaker conditioning first:

uv run mlx-indextts speaker \
  -m models/mlx-indextts2-standard-fp16 \
  -r /path/to/reference.wav \
  -o speaker.npz \
  --memory-limit 24

Benchmark

Benchmarked on a 128GB unified-memory M3 Max Mac using:

mlx-indextts from solar2ain/mlx-indextts
precomputed .npz speaker conditioning
memory_limit=24GB
diffusion_steps=16
emotion=calm, emo_alpha=0.6
same text set across fp32 / fp16 / 8bit / optimized PyTorch MPS

RTF lower is faster:

Case	fp32 MLX RTF	fp16 MLX RTF	8bit MLX RTF	PyTorch MPS RTF
zh short	1.127	1.538	0.966	1.446
zh long	1.232	1.584	1.035	1.699
en short	1.157	1.462	0.914	2.192
en long	1.193	1.511	0.956	1.783

Summary from the local comparison:

8bit was the fastest MLX route in this test set.
fp16 saved space but was slower than fp32 for the standard profile.
Vietnamese fp16 was slightly faster than Vietnamese fp32, but Vietnamese 8bit was fastest.

ASR Validation

ASR validation with local mlx_whisper + whisper-large-v3-turbo found no empty audio, wrong-language output, or obvious missing sentences. Chinese long-form ASR showed a minor 她/他 homophone difference; English long-form 8-bit ASR showed a minor tense difference.

ASR was used only as an automated sanity check. Final production selection should still include human listening, especially for long-form Vietnamese narration.

Provenance and Scope

This is an MLX conversion for local Apple Silicon inference, not the original PyTorch release. The original implementation and model family are associated with IndexTTS / IndexTTS2; the MLX runtime used here is solar2ain/mlx-indextts.

The benchmark numbers are environment-specific and should be treated as local M3 Max results, not universal performance guarantees.