Add files using upload-large-folder tool

Browse files

Files changed (11) hide show

README.md +115 -0
bigvgan.safetensors +3 -0
config.json +125 -0
config.yaml +120 -0
feat1.pt +3 -0
feat2.pt +3 -0
gpt.safetensors +3 -0
s2mel.safetensors +3 -0
tokenizer.model +3 -0
vq2emb.safetensors +3 -0
wav2vec2bert_stats.pt +3 -0

README.md ADDED Viewed

	@@ -0,0 +1,115 @@

+---
+library_name: mlx
+pipeline_tag: text-to-speech
+tags:
+- indextts2
+- mlx-indextts
+- voice-cloning
+- fp16
+- zh
+- en
+- text-to-speech
+- apple-silicon
+- mlx
+license: mit
+---
+# mlx-indextts2-standard-fp16
+This is a converted MLX IndexTTS2 model for Apple Silicon inference with [`solar2ain/mlx-indextts`](https://github.com/solar2ain/mlx-indextts).
+It was prepared for the local `/Users/vanch/index-tts` IndexTTS2 optimization project, where the goal was stable Vietnamese and multilingual TTS on an M3 Max Mac without PyTorch MPS memory crashes.
+## Variant
+- Profile: **Standard multilingual**
+- Precision / quantization: **fp16**
+- Approx local size: **2.0GB**
+- Source checkpoint directory during conversion: `/Users/vanch/index-tts/checkpoints`
+- Note: All floating MLX weights cast to fp16 from the standard fp32 conversion.
+- Conversion detail: Derived locally by casting floating MLX safetensors to `float16`; this is not an upstream CLI quantization mode.
+## Expected Files
+The repository root is a ready-to-use MLX IndexTTS2 model directory:
+- `gpt.safetensors`
+- `s2mel.safetensors`
+- `bigvgan.safetensors`
+- `vq2emb.safetensors`
+- `tokenizer.model`
+- `config.yaml`
+- `config.json`
+- `feat1.pt`
+- `feat2.pt`
+- `wav2vec2bert_stats.pt`
+## Usage
+Install and use `mlx-indextts`:
+```bash
+git clone https://github.com/solar2ain/mlx-indextts.git
+cd mlx-indextts
+uv sync --extra convert --extra v2
+huggingface-cli download vanch007/mlx-indextts2-standard-fp16 \
+  --local-dir models/mlx-indextts2-standard-fp16 \
+  --local-dir-use-symlinks False
+uv run mlx-indextts generate \
+  -m models/mlx-indextts2-standard-fp16 \
+  -r /path/to/reference_or_speaker.npz \
+  -t "Your text here" \
+  -o output.wav \
+  --memory-limit 24 \
+  --diffusion-steps 16
+```
+For repeated generation, precompute speaker conditioning first:
+```bash
+uv run mlx-indextts speaker \
+  -m models/mlx-indextts2-standard-fp16 \
+  -r /path/to/reference.wav \
+  -o speaker.npz \
+  --memory-limit 24
+```
+## Benchmark
+Benchmarked on a 128GB unified-memory M3 Max Mac using:
+- `mlx-indextts` from `solar2ain/mlx-indextts`
+- precomputed `.npz` speaker conditioning
+- `memory_limit=24GB`
+- `diffusion_steps=16`
+- emotion=`calm`, `emo_alpha=0.6`
+- same text set across fp32 / fp16 / 8bit / optimized PyTorch MPS
+RTF lower is faster:
+| Case | fp32 MLX RTF | fp16 MLX RTF | 8bit MLX RTF | PyTorch MPS RTF |
+|---|---:|---:|---:|---:|
+| zh short | 1.127 | 1.538 | 0.966 | 1.446 |
+| zh long | 1.232 | 1.584 | 1.035 | 1.699 |
+| en short | 1.157 | 1.462 | 0.914 | 2.192 |
+| en long | 1.193 | 1.511 | 0.956 | 1.783 |
+Summary from the local comparison:
+- 8bit was the fastest MLX route in this test set.
+- fp16 saved space but was slower than fp32 for the standard profile.
+- Vietnamese fp16 was slightly faster than Vietnamese fp32, but Vietnamese 8bit was fastest.
+## ASR Validation
+ASR validation with local `mlx_whisper` + `whisper-large-v3-turbo` found no empty audio, wrong-language output, or obvious missing sentences. Chinese long-form ASR showed a minor `她/他` homophone difference; English long-form 8-bit ASR showed a minor tense difference.
+ASR was used only as an automated sanity check. Final production selection should still include human listening, especially for long-form Vietnamese narration.
+## Provenance and Scope
+This is an MLX conversion for local Apple Silicon inference, not the original PyTorch release. The original implementation and model family are associated with IndexTTS / IndexTTS2; the MLX runtime used here is `solar2ain/mlx-indextts`.
+The benchmark numbers are environment-specific and should be treated as local M3 Max results, not universal performance guarantees.

bigvgan.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:83ae363fd99e08258ad83e9c6c4c05ecb3da0fc77792fb4f48d3f53d7a9dffab
+size 224443907

config.json ADDED Viewed

	@@ -0,0 +1,125 @@

+{
+  "gpt": {
+    "model_dim": 1280,
+    "heads": 20,
+    "layers": 24,
+    "max_mel_tokens": 1815,
+    "max_text_tokens": 600,
+    "number_text_tokens": 12000,
+    "number_mel_codes": 8194,
+    "start_mel_token": 8192,
+    "stop_mel_token": 8193,
+    "start_text_token": 0,
+    "stop_text_token": 1,
+    "use_mel_codes_as_input": true,
+    "mel_length_compression": 1024,
+    "condition_type": "conformer_perceiver",
+    "condition_num_latent": 32,
+    "max_conditioning_inputs": 1,
+    "condition_module": {
+      "input_size": 100,
+      "output_size": 512,
+      "linear_units": 2048,
+      "attention_heads": 8,
+      "num_blocks": 6,
+      "dropout_rate": 0.0,
+      "input_layer": "conv2d2",
+      "pos_enc_layer_type": "rel_pos",
+      "normalize_before": true,
+      "use_cnn_module": true,
+      "cnn_module_kernel": 15,
+      "perceiver_mult": 2
+    },
+    "emo_condition_module": {
+      "input_size": 100,
+      "output_size": 512,
+      "linear_units": 1024,
+      "attention_heads": 4,
+      "num_blocks": 4,
+      "dropout_rate": 0.0,
+      "input_layer": "conv2d2",
+      "pos_enc_layer_type": "rel_pos",
+      "normalize_before": true,
+      "use_cnn_module": true,
+      "cnn_module_kernel": 15,
+      "perceiver_mult": 2
+    }
+  },
+  "bigvgan": {
+    "resblock": "1",
+    "upsample_rates": [
+      4,
+      4,
+      4,
+      4,
+      2,
+      2
+    ],
+    "upsample_kernel_sizes": [
+      8,
+      8,
+      4,
+      4,
+      4,
+      4
+    ],
+    "upsample_initial_channel": 1536,
+    "resblock_kernel_sizes": [
+      3,
+      7,
+      11
+    ],
+    "resblock_dilation_sizes": [
+      [
+        1,
+        3,
+        5
+      ],
+      [
+        1,
+        3,
+        5
+      ],
+      [
+        1,
+        3,
+        5
+      ]
+    ],
+    "gpt_dim": 1024,
+    "num_mels": 100,
+    "speaker_embedding_dim": 512,
+    "cond_d_vector_in_each_upsampling_layer": true,
+    "activation": "snakebeta",
+    "snake_logscale": true,
+    "feat_upsample": false,
+    "use_tanh_at_final": true
+  },
+  "mel": {
+    "sample_rate": 22050,
+    "n_fft": 1024,
+    "hop_length": 256,
+    "win_length": 1024,
+    "n_mels": 80,
+    "mel_fmin": 0.0,
+    "mel_fmax": null,
+    "normalize": false
+  },
+  "bpe_model": "bpe.model",
+  "gpt_checkpoint": "gpt.pth",
+  "bigvgan_checkpoint": "",
+  "version": 2.0,
+  "sample_rate": 22050,
+  "s2mel": {
+    "sr": 22050,
+    "n_fft": 1024,
+    "hop_length": 256,
+    "win_length": 1024,
+    "n_mels": 80
+  },
+  "precision": "fp16",
+  "fp16_conversion": {
+    "floating_weights": "cast_to_float16",
+    "source": "mlx fp32/fp16 safetensors"
+  }
+}

config.yaml ADDED Viewed

	@@ -0,0 +1,120 @@

+dataset:
+    bpe_model: bpe.model
+    sample_rate: 24000
+    squeeze: false
+    mel:
+        sample_rate: 24000
+        n_fft: 1024
+        hop_length: 256
+        win_length: 1024
+        n_mels: 100
+        mel_fmin: 0
+        normalize: false
+gpt:
+    model_dim: 1280
+    max_mel_tokens: 1815
+    max_text_tokens: 600
+    heads: 20
+    use_mel_codes_as_input: true
+    mel_length_compression: 1024
+    layers: 24
+    number_text_tokens: 12000
+    number_mel_codes: 8194
+    start_mel_token: 8192
+    stop_mel_token: 8193
+    start_text_token: 0
+    stop_text_token: 1
+    train_solo_embeddings: false
+    condition_type: "conformer_perceiver"
+    condition_module:
+        output_size: 512
+        linear_units: 2048
+        attention_heads: 8
+        num_blocks: 6
+        input_layer: "conv2d2"
+        perceiver_mult: 2
+    emo_condition_module:
+        output_size: 512
+        linear_units: 1024
+        attention_heads: 4
+        num_blocks: 4
+        input_layer: "conv2d2"
+        perceiver_mult: 2
+semantic_codec:
+    codebook_size: 8192
+    hidden_size: 1024
+    codebook_dim: 8
+    vocos_dim: 384
+    vocos_intermediate_dim: 2048
+    vocos_num_layers: 12
+s2mel:
+    preprocess_params:
+        sr: 22050
+        spect_params:
+            n_fft: 1024
+            win_length: 1024
+            hop_length: 256
+            n_mels: 80
+            fmin: 0
+            fmax: "None"
+    dit_type: "DiT"
+    reg_loss_type: "l1"
+    style_encoder:
+        dim: 192
+    length_regulator:
+        channels: 512
+        is_discrete: false
+        in_channels: 1024
+        content_codebook_size: 2048
+        sampling_ratios: [1, 1, 1, 1]
+        vector_quantize: false
+        n_codebooks: 1
+        quantizer_dropout: 0.0
+        f0_condition: false
+        n_f0_bins: 512
+    DiT:
+        hidden_dim: 512
+        num_heads: 8
+        depth: 13
+        class_dropout_prob: 0.1
+        block_size: 8192
+        in_channels: 80
+        style_condition: true
+        final_layer_type: 'wavenet'
+        target: 'mel'
+        content_dim: 512
+        content_codebook_size: 1024
+        content_type: 'discrete'
+        f0_condition: false
+        n_f0_bins: 512
+        content_codebooks: 1
+        is_causal: false
+        long_skip_connection: true
+        zero_prompt_speech_token: false
+        time_as_token: false
+        style_as_token: false
+        uvit_skip_connection: true
+        add_resblock_in_transformer: false
+    wavenet:
+        hidden_dim: 512
+        num_layers: 8
+        kernel_size: 5
+        dilation_rate: 1
+        p_dropout: 0.2
+        style_condition: true
+gpt_checkpoint: gpt.pth
+w2v_stat: wav2vec2bert_stats.pt
+s2mel_checkpoint: s2mel.pth
+emo_matrix: feat2.pt
+spk_matrix: feat1.pt
+emo_num: [3, 17, 2, 8, 4, 5, 10, 24]
+qwen_emo_path: qwen0.6bemo4-merge/
+vocoder:
+    type: "bigvgan"
+    name: "nvidia/bigvgan_v2_22khz_80band_256x"
+version: 2.0

feat1.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f219cb447d80216ba615666da2ff8d63ac544eee26657f3a7b278692bf7a67c4
+size 57170

feat2.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9c4292e96dee535aea9a6206e9a0c856dd578dde9212acdb16dd3ada4d12bf80
+size 374866

gpt.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cddff93214b1e15abc219d8810e34c04a7a9fd01838b4a795c6bd84d423ac513
+size 1732036338

s2mel.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9b57aa4a572cc827b2bb0920303f9a515e9be799583343f65706c3160f200f96
+size 207320382

tokenizer.model ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b2a5ce8090d32da3642cc4f81fdc996376bc6dd3f4cd5e3d165f71120d9f2bc8
+size 475997

vq2emb.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2aa0b789c0e3b7c55e4faade1ba3f4e494d8086420381f4ff0907c8b9b595ef6
+size 149778

wav2vec2bert_stats.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c9c176c2b8850ab2e3ba828bbfa969deaf4566ce55db5f2687b8430b87526ad2
+size 9343