File size: 3,688 Bytes

31118db

---
library_name: mlx
pipeline_tag: text-to-speech
tags:
- indextts2
- mlx-indextts
- voice-cloning
- fp16
- zh
- en
- text-to-speech
- apple-silicon
- mlx
license: mit
---

# mlx-indextts2-standard-fp16

This is a converted MLX IndexTTS2 model for Apple Silicon inference with [`solar2ain/mlx-indextts`](https://github.com/solar2ain/mlx-indextts).

It was prepared for the local `/Users/vanch/index-tts` IndexTTS2 optimization project, where the goal was stable Vietnamese and multilingual TTS on an M3 Max Mac without PyTorch MPS memory crashes.

## Variant

- Profile: **Standard multilingual**
- Precision / quantization: **fp16**
- Approx local size: **2.0GB**
- Source checkpoint directory during conversion: `/Users/vanch/index-tts/checkpoints`
- Note: All floating MLX weights cast to fp16 from the standard fp32 conversion.
- Conversion detail: Derived locally by casting floating MLX safetensors to `float16`; this is not an upstream CLI quantization mode.

## Expected Files

The repository root is a ready-to-use MLX IndexTTS2 model directory:

- `gpt.safetensors`
- `s2mel.safetensors`
- `bigvgan.safetensors`
- `vq2emb.safetensors`
- `tokenizer.model`
- `config.yaml`
- `config.json`
- `feat1.pt`
- `feat2.pt`
- `wav2vec2bert_stats.pt`

## Usage

Install and use `mlx-indextts`:

```bash
git clone https://github.com/solar2ain/mlx-indextts.git
cd mlx-indextts
uv sync --extra convert --extra v2

huggingface-cli download vanch007/mlx-indextts2-standard-fp16 \
  --local-dir models/mlx-indextts2-standard-fp16 \
  --local-dir-use-symlinks False

uv run mlx-indextts generate \
  -m models/mlx-indextts2-standard-fp16 \
  -r /path/to/reference_or_speaker.npz \
  -t "Your text here" \
  -o output.wav \
  --memory-limit 24 \
  --diffusion-steps 16
```

For repeated generation, precompute speaker conditioning first:

```bash
uv run mlx-indextts speaker \
  -m models/mlx-indextts2-standard-fp16 \
  -r /path/to/reference.wav \
  -o speaker.npz \
  --memory-limit 24
```

## Benchmark

Benchmarked on a 128GB unified-memory M3 Max Mac using:

- `mlx-indextts` from `solar2ain/mlx-indextts`
- precomputed `.npz` speaker conditioning
- `memory_limit=24GB`
- `diffusion_steps=16`
- emotion=`calm`, `emo_alpha=0.6`
- same text set across fp32 / fp16 / 8bit / optimized PyTorch MPS

RTF lower is faster:

| Case | fp32 MLX RTF | fp16 MLX RTF | 8bit MLX RTF | PyTorch MPS RTF |
|---|---:|---:|---:|---:|
| zh short | 1.127 | 1.538 | 0.966 | 1.446 |
| zh long | 1.232 | 1.584 | 1.035 | 1.699 |
| en short | 1.157 | 1.462 | 0.914 | 2.192 |
| en long | 1.193 | 1.511 | 0.956 | 1.783 |

Summary from the local comparison:

- 8bit was the fastest MLX route in this test set.
- fp16 saved space but was slower than fp32 for the standard profile.
- Vietnamese fp16 was slightly faster than Vietnamese fp32, but Vietnamese 8bit was fastest.

## ASR Validation

ASR validation with local `mlx_whisper` + `whisper-large-v3-turbo` found no empty audio, wrong-language output, or obvious missing sentences. Chinese long-form ASR showed a minor `她/他` homophone difference; English long-form 8-bit ASR showed a minor tense difference.

ASR was used only as an automated sanity check. Final production selection should still include human listening, especially for long-form Vietnamese narration.

## Provenance and Scope

This is an MLX conversion for local Apple Silicon inference, not the original PyTorch release. The original implementation and model family are associated with IndexTTS / IndexTTS2; the MLX runtime used here is `solar2ain/mlx-indextts`.

The benchmark numbers are environment-specific and should be treated as local M3 Max results, not universal performance guarantees.