vanch007's picture
Add files using upload-large-folder tool
31118db verified
|
raw
history blame
3.69 kB
metadata
library_name: mlx
pipeline_tag: text-to-speech
tags:
  - indextts2
  - mlx-indextts
  - voice-cloning
  - fp16
  - zh
  - en
  - text-to-speech
  - apple-silicon
  - mlx
license: mit

mlx-indextts2-standard-fp16

This is a converted MLX IndexTTS2 model for Apple Silicon inference with solar2ain/mlx-indextts.

It was prepared for the local /Users/vanch/index-tts IndexTTS2 optimization project, where the goal was stable Vietnamese and multilingual TTS on an M3 Max Mac without PyTorch MPS memory crashes.

Variant

  • Profile: Standard multilingual
  • Precision / quantization: fp16
  • Approx local size: 2.0GB
  • Source checkpoint directory during conversion: /Users/vanch/index-tts/checkpoints
  • Note: All floating MLX weights cast to fp16 from the standard fp32 conversion.
  • Conversion detail: Derived locally by casting floating MLX safetensors to float16; this is not an upstream CLI quantization mode.

Expected Files

The repository root is a ready-to-use MLX IndexTTS2 model directory:

  • gpt.safetensors
  • s2mel.safetensors
  • bigvgan.safetensors
  • vq2emb.safetensors
  • tokenizer.model
  • config.yaml
  • config.json
  • feat1.pt
  • feat2.pt
  • wav2vec2bert_stats.pt

Usage

Install and use mlx-indextts:

git clone https://github.com/solar2ain/mlx-indextts.git
cd mlx-indextts
uv sync --extra convert --extra v2

huggingface-cli download vanch007/mlx-indextts2-standard-fp16 \
  --local-dir models/mlx-indextts2-standard-fp16 \
  --local-dir-use-symlinks False

uv run mlx-indextts generate \
  -m models/mlx-indextts2-standard-fp16 \
  -r /path/to/reference_or_speaker.npz \
  -t "Your text here" \
  -o output.wav \
  --memory-limit 24 \
  --diffusion-steps 16

For repeated generation, precompute speaker conditioning first:

uv run mlx-indextts speaker \
  -m models/mlx-indextts2-standard-fp16 \
  -r /path/to/reference.wav \
  -o speaker.npz \
  --memory-limit 24

Benchmark

Benchmarked on a 128GB unified-memory M3 Max Mac using:

  • mlx-indextts from solar2ain/mlx-indextts
  • precomputed .npz speaker conditioning
  • memory_limit=24GB
  • diffusion_steps=16
  • emotion=calm, emo_alpha=0.6
  • same text set across fp32 / fp16 / 8bit / optimized PyTorch MPS

RTF lower is faster:

Case fp32 MLX RTF fp16 MLX RTF 8bit MLX RTF PyTorch MPS RTF
zh short 1.127 1.538 0.966 1.446
zh long 1.232 1.584 1.035 1.699
en short 1.157 1.462 0.914 2.192
en long 1.193 1.511 0.956 1.783

Summary from the local comparison:

  • 8bit was the fastest MLX route in this test set.
  • fp16 saved space but was slower than fp32 for the standard profile.
  • Vietnamese fp16 was slightly faster than Vietnamese fp32, but Vietnamese 8bit was fastest.

ASR Validation

ASR validation with local mlx_whisper + whisper-large-v3-turbo found no empty audio, wrong-language output, or obvious missing sentences. Chinese long-form ASR showed a minor 她/他 homophone difference; English long-form 8-bit ASR showed a minor tense difference.

ASR was used only as an automated sanity check. Final production selection should still include human listening, especially for long-form Vietnamese narration.

Provenance and Scope

This is an MLX conversion for local Apple Silicon inference, not the original PyTorch release. The original implementation and model family are associated with IndexTTS / IndexTTS2; the MLX runtime used here is solar2ain/mlx-indextts.

The benchmark numbers are environment-specific and should be treated as local M3 Max results, not universal performance guarantees.