vanch007 commited on
Commit
31118db
·
verified ·
1 Parent(s): 47e3b39

Add files using upload-large-folder tool

Browse files
README.md ADDED
@@ -0,0 +1,115 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: mlx
3
+ pipeline_tag: text-to-speech
4
+ tags:
5
+ - indextts2
6
+ - mlx-indextts
7
+ - voice-cloning
8
+ - fp16
9
+ - zh
10
+ - en
11
+ - text-to-speech
12
+ - apple-silicon
13
+ - mlx
14
+ license: mit
15
+ ---
16
+
17
+ # mlx-indextts2-standard-fp16
18
+
19
+ This is a converted MLX IndexTTS2 model for Apple Silicon inference with [`solar2ain/mlx-indextts`](https://github.com/solar2ain/mlx-indextts).
20
+
21
+ It was prepared for the local `/Users/vanch/index-tts` IndexTTS2 optimization project, where the goal was stable Vietnamese and multilingual TTS on an M3 Max Mac without PyTorch MPS memory crashes.
22
+
23
+ ## Variant
24
+
25
+ - Profile: **Standard multilingual**
26
+ - Precision / quantization: **fp16**
27
+ - Approx local size: **2.0GB**
28
+ - Source checkpoint directory during conversion: `/Users/vanch/index-tts/checkpoints`
29
+ - Note: All floating MLX weights cast to fp16 from the standard fp32 conversion.
30
+ - Conversion detail: Derived locally by casting floating MLX safetensors to `float16`; this is not an upstream CLI quantization mode.
31
+
32
+ ## Expected Files
33
+
34
+ The repository root is a ready-to-use MLX IndexTTS2 model directory:
35
+
36
+ - `gpt.safetensors`
37
+ - `s2mel.safetensors`
38
+ - `bigvgan.safetensors`
39
+ - `vq2emb.safetensors`
40
+ - `tokenizer.model`
41
+ - `config.yaml`
42
+ - `config.json`
43
+ - `feat1.pt`
44
+ - `feat2.pt`
45
+ - `wav2vec2bert_stats.pt`
46
+
47
+ ## Usage
48
+
49
+ Install and use `mlx-indextts`:
50
+
51
+ ```bash
52
+ git clone https://github.com/solar2ain/mlx-indextts.git
53
+ cd mlx-indextts
54
+ uv sync --extra convert --extra v2
55
+
56
+ huggingface-cli download vanch007/mlx-indextts2-standard-fp16 \
57
+ --local-dir models/mlx-indextts2-standard-fp16 \
58
+ --local-dir-use-symlinks False
59
+
60
+ uv run mlx-indextts generate \
61
+ -m models/mlx-indextts2-standard-fp16 \
62
+ -r /path/to/reference_or_speaker.npz \
63
+ -t "Your text here" \
64
+ -o output.wav \
65
+ --memory-limit 24 \
66
+ --diffusion-steps 16
67
+ ```
68
+
69
+ For repeated generation, precompute speaker conditioning first:
70
+
71
+ ```bash
72
+ uv run mlx-indextts speaker \
73
+ -m models/mlx-indextts2-standard-fp16 \
74
+ -r /path/to/reference.wav \
75
+ -o speaker.npz \
76
+ --memory-limit 24
77
+ ```
78
+
79
+ ## Benchmark
80
+
81
+ Benchmarked on a 128GB unified-memory M3 Max Mac using:
82
+
83
+ - `mlx-indextts` from `solar2ain/mlx-indextts`
84
+ - precomputed `.npz` speaker conditioning
85
+ - `memory_limit=24GB`
86
+ - `diffusion_steps=16`
87
+ - emotion=`calm`, `emo_alpha=0.6`
88
+ - same text set across fp32 / fp16 / 8bit / optimized PyTorch MPS
89
+
90
+ RTF lower is faster:
91
+
92
+ | Case | fp32 MLX RTF | fp16 MLX RTF | 8bit MLX RTF | PyTorch MPS RTF |
93
+ |---|---:|---:|---:|---:|
94
+ | zh short | 1.127 | 1.538 | 0.966 | 1.446 |
95
+ | zh long | 1.232 | 1.584 | 1.035 | 1.699 |
96
+ | en short | 1.157 | 1.462 | 0.914 | 2.192 |
97
+ | en long | 1.193 | 1.511 | 0.956 | 1.783 |
98
+
99
+ Summary from the local comparison:
100
+
101
+ - 8bit was the fastest MLX route in this test set.
102
+ - fp16 saved space but was slower than fp32 for the standard profile.
103
+ - Vietnamese fp16 was slightly faster than Vietnamese fp32, but Vietnamese 8bit was fastest.
104
+
105
+ ## ASR Validation
106
+
107
+ ASR validation with local `mlx_whisper` + `whisper-large-v3-turbo` found no empty audio, wrong-language output, or obvious missing sentences. Chinese long-form ASR showed a minor `她/他` homophone difference; English long-form 8-bit ASR showed a minor tense difference.
108
+
109
+ ASR was used only as an automated sanity check. Final production selection should still include human listening, especially for long-form Vietnamese narration.
110
+
111
+ ## Provenance and Scope
112
+
113
+ This is an MLX conversion for local Apple Silicon inference, not the original PyTorch release. The original implementation and model family are associated with IndexTTS / IndexTTS2; the MLX runtime used here is `solar2ain/mlx-indextts`.
114
+
115
+ The benchmark numbers are environment-specific and should be treated as local M3 Max results, not universal performance guarantees.
bigvgan.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:83ae363fd99e08258ad83e9c6c4c05ecb3da0fc77792fb4f48d3f53d7a9dffab
3
+ size 224443907
config.json ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "gpt": {
3
+ "model_dim": 1280,
4
+ "heads": 20,
5
+ "layers": 24,
6
+ "max_mel_tokens": 1815,
7
+ "max_text_tokens": 600,
8
+ "number_text_tokens": 12000,
9
+ "number_mel_codes": 8194,
10
+ "start_mel_token": 8192,
11
+ "stop_mel_token": 8193,
12
+ "start_text_token": 0,
13
+ "stop_text_token": 1,
14
+ "use_mel_codes_as_input": true,
15
+ "mel_length_compression": 1024,
16
+ "condition_type": "conformer_perceiver",
17
+ "condition_num_latent": 32,
18
+ "max_conditioning_inputs": 1,
19
+ "condition_module": {
20
+ "input_size": 100,
21
+ "output_size": 512,
22
+ "linear_units": 2048,
23
+ "attention_heads": 8,
24
+ "num_blocks": 6,
25
+ "dropout_rate": 0.0,
26
+ "input_layer": "conv2d2",
27
+ "pos_enc_layer_type": "rel_pos",
28
+ "normalize_before": true,
29
+ "use_cnn_module": true,
30
+ "cnn_module_kernel": 15,
31
+ "perceiver_mult": 2
32
+ },
33
+ "emo_condition_module": {
34
+ "input_size": 100,
35
+ "output_size": 512,
36
+ "linear_units": 1024,
37
+ "attention_heads": 4,
38
+ "num_blocks": 4,
39
+ "dropout_rate": 0.0,
40
+ "input_layer": "conv2d2",
41
+ "pos_enc_layer_type": "rel_pos",
42
+ "normalize_before": true,
43
+ "use_cnn_module": true,
44
+ "cnn_module_kernel": 15,
45
+ "perceiver_mult": 2
46
+ }
47
+ },
48
+ "bigvgan": {
49
+ "resblock": "1",
50
+ "upsample_rates": [
51
+ 4,
52
+ 4,
53
+ 4,
54
+ 4,
55
+ 2,
56
+ 2
57
+ ],
58
+ "upsample_kernel_sizes": [
59
+ 8,
60
+ 8,
61
+ 4,
62
+ 4,
63
+ 4,
64
+ 4
65
+ ],
66
+ "upsample_initial_channel": 1536,
67
+ "resblock_kernel_sizes": [
68
+ 3,
69
+ 7,
70
+ 11
71
+ ],
72
+ "resblock_dilation_sizes": [
73
+ [
74
+ 1,
75
+ 3,
76
+ 5
77
+ ],
78
+ [
79
+ 1,
80
+ 3,
81
+ 5
82
+ ],
83
+ [
84
+ 1,
85
+ 3,
86
+ 5
87
+ ]
88
+ ],
89
+ "gpt_dim": 1024,
90
+ "num_mels": 100,
91
+ "speaker_embedding_dim": 512,
92
+ "cond_d_vector_in_each_upsampling_layer": true,
93
+ "activation": "snakebeta",
94
+ "snake_logscale": true,
95
+ "feat_upsample": false,
96
+ "use_tanh_at_final": true
97
+ },
98
+ "mel": {
99
+ "sample_rate": 22050,
100
+ "n_fft": 1024,
101
+ "hop_length": 256,
102
+ "win_length": 1024,
103
+ "n_mels": 80,
104
+ "mel_fmin": 0.0,
105
+ "mel_fmax": null,
106
+ "normalize": false
107
+ },
108
+ "bpe_model": "bpe.model",
109
+ "gpt_checkpoint": "gpt.pth",
110
+ "bigvgan_checkpoint": "",
111
+ "version": 2.0,
112
+ "sample_rate": 22050,
113
+ "s2mel": {
114
+ "sr": 22050,
115
+ "n_fft": 1024,
116
+ "hop_length": 256,
117
+ "win_length": 1024,
118
+ "n_mels": 80
119
+ },
120
+ "precision": "fp16",
121
+ "fp16_conversion": {
122
+ "floating_weights": "cast_to_float16",
123
+ "source": "mlx fp32/fp16 safetensors"
124
+ }
125
+ }
config.yaml ADDED
@@ -0,0 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ dataset:
2
+ bpe_model: bpe.model
3
+ sample_rate: 24000
4
+ squeeze: false
5
+ mel:
6
+ sample_rate: 24000
7
+ n_fft: 1024
8
+ hop_length: 256
9
+ win_length: 1024
10
+ n_mels: 100
11
+ mel_fmin: 0
12
+ normalize: false
13
+
14
+ gpt:
15
+ model_dim: 1280
16
+ max_mel_tokens: 1815
17
+ max_text_tokens: 600
18
+ heads: 20
19
+ use_mel_codes_as_input: true
20
+ mel_length_compression: 1024
21
+ layers: 24
22
+ number_text_tokens: 12000
23
+ number_mel_codes: 8194
24
+ start_mel_token: 8192
25
+ stop_mel_token: 8193
26
+ start_text_token: 0
27
+ stop_text_token: 1
28
+ train_solo_embeddings: false
29
+ condition_type: "conformer_perceiver"
30
+ condition_module:
31
+ output_size: 512
32
+ linear_units: 2048
33
+ attention_heads: 8
34
+ num_blocks: 6
35
+ input_layer: "conv2d2"
36
+ perceiver_mult: 2
37
+ emo_condition_module:
38
+ output_size: 512
39
+ linear_units: 1024
40
+ attention_heads: 4
41
+ num_blocks: 4
42
+ input_layer: "conv2d2"
43
+ perceiver_mult: 2
44
+
45
+ semantic_codec:
46
+ codebook_size: 8192
47
+ hidden_size: 1024
48
+ codebook_dim: 8
49
+ vocos_dim: 384
50
+ vocos_intermediate_dim: 2048
51
+ vocos_num_layers: 12
52
+
53
+ s2mel:
54
+ preprocess_params:
55
+ sr: 22050
56
+ spect_params:
57
+ n_fft: 1024
58
+ win_length: 1024
59
+ hop_length: 256
60
+ n_mels: 80
61
+ fmin: 0
62
+ fmax: "None"
63
+
64
+ dit_type: "DiT"
65
+ reg_loss_type: "l1"
66
+ style_encoder:
67
+ dim: 192
68
+ length_regulator:
69
+ channels: 512
70
+ is_discrete: false
71
+ in_channels: 1024
72
+ content_codebook_size: 2048
73
+ sampling_ratios: [1, 1, 1, 1]
74
+ vector_quantize: false
75
+ n_codebooks: 1
76
+ quantizer_dropout: 0.0
77
+ f0_condition: false
78
+ n_f0_bins: 512
79
+ DiT:
80
+ hidden_dim: 512
81
+ num_heads: 8
82
+ depth: 13
83
+ class_dropout_prob: 0.1
84
+ block_size: 8192
85
+ in_channels: 80
86
+ style_condition: true
87
+ final_layer_type: 'wavenet'
88
+ target: 'mel'
89
+ content_dim: 512
90
+ content_codebook_size: 1024
91
+ content_type: 'discrete'
92
+ f0_condition: false
93
+ n_f0_bins: 512
94
+ content_codebooks: 1
95
+ is_causal: false
96
+ long_skip_connection: true
97
+ zero_prompt_speech_token: false
98
+ time_as_token: false
99
+ style_as_token: false
100
+ uvit_skip_connection: true
101
+ add_resblock_in_transformer: false
102
+ wavenet:
103
+ hidden_dim: 512
104
+ num_layers: 8
105
+ kernel_size: 5
106
+ dilation_rate: 1
107
+ p_dropout: 0.2
108
+ style_condition: true
109
+
110
+ gpt_checkpoint: gpt.pth
111
+ w2v_stat: wav2vec2bert_stats.pt
112
+ s2mel_checkpoint: s2mel.pth
113
+ emo_matrix: feat2.pt
114
+ spk_matrix: feat1.pt
115
+ emo_num: [3, 17, 2, 8, 4, 5, 10, 24]
116
+ qwen_emo_path: qwen0.6bemo4-merge/
117
+ vocoder:
118
+ type: "bigvgan"
119
+ name: "nvidia/bigvgan_v2_22khz_80band_256x"
120
+ version: 2.0
feat1.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f219cb447d80216ba615666da2ff8d63ac544eee26657f3a7b278692bf7a67c4
3
+ size 57170
feat2.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9c4292e96dee535aea9a6206e9a0c856dd578dde9212acdb16dd3ada4d12bf80
3
+ size 374866
gpt.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cddff93214b1e15abc219d8810e34c04a7a9fd01838b4a795c6bd84d423ac513
3
+ size 1732036338
s2mel.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9b57aa4a572cc827b2bb0920303f9a515e9be799583343f65706c3160f200f96
3
+ size 207320382
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b2a5ce8090d32da3642cc4f81fdc996376bc6dd3f4cd5e3d165f71120d9f2bc8
3
+ size 475997
vq2emb.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2aa0b789c0e3b7c55e4faade1ba3f4e494d8086420381f4ff0907c8b9b595ef6
3
+ size 149778
wav2vec2bert_stats.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c9c176c2b8850ab2e3ba828bbfa969deaf4566ce55db5f2687b8430b87526ad2
3
+ size 9343