Instructions to use younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Voxtral-4B-TTS-2603-ExecuTorch-MLX younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- LM Studio
Voxtral-4B-TTS-2603-ExecuTorch-MLX
Pre-exported ExecuTorch artifacts for Voxtral-4B-TTS-2603 with the MLX backend for Apple Silicon. The LM decoder and flow head use bf16 precision with 4-bit weight-only linear quantization and 8-bit embedding quantization. The codec decoder is exported unquantized and lowered natively to MLX.
This repository is the Apple Silicon companion to the CUDA artifact repo: younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDA.
Overview
The pipeline has two stages: export (Python, once) and inference (C++ runner, repeated). This repo ships the export outputs so you can skip straight to inference on a locally built ExecuTorch MLX runner.
The model has three components:
- Mistral 4B LLM decoder โ autoregressive text to hidden states
- Flow Matching Head โ hidden states to 37 audio codebook tokens per frame
- Codec Decoder โ codebook tokens to 24 kHz mono waveform
Files
| File | Size | What |
|---|---|---|
model.pte |
2.20 GiB | LM decoder, token embedding, audio embedding, semantic head, and flow velocity methods lowered to MLX |
codec_decoder.pte |
289 MiB | Native MLX codec decoder for waveform synthesis |
The tokenizer and voice embeddings are not included. Download them from the base model so they match the upstream Voxtral release.
Performance
Validated on Apple Silicon with seed=42 and prompt
"Hello, how are you today?".
| Config | Audio | Generate time | Generation RTF | Process wall | Notes |
|---|---|---|---|---|---|
| MLX bf16 + 4w linear + 8w embedding | 3.44 s | 2932 ms | 0.852326 | 4.20 s | refreshed after MLX indexing fix |
| MLX bf16 + 4w linear + 8w embedding | 3.44 s | 3132 ms | 0.910465 | 5.19 s | first measured run |
| MLX bf16 + 4w linear + 8w embedding | 3.44 s | 2634 ms | 0.765698 | 3.15 s | warm run |
| MLX bf16 + 4w linear + 8w embedding | 3.44 s | 2607 ms | 0.757849 | 3.13 s | warm run |
Latest WAV quality check: peak 0.425764, clipped samples 0. Apple Speech
transcribed the original generated sample as Hello how are you today.
Prerequisites
- macOS on Apple Silicon.
- ExecuTorch built from source with
EXECUTORCH_BUILD_MLX=ON. - Tokenizer and voice embeddings from mistralai/Voxtral-4B-TTS-2603.
git clone https://github.com/pytorch/executorch ~/executorch
cd ~/executorch
./install_executorch.sh
pip install -e . --no-build-isolation
make voxtral_tts-mlx
The native codec artifacts were validated against ExecuTorch source commit:
ba5b038400299a383dbe93ab394a30f42a953cc1
Download
pip install huggingface_hub
# ExecuTorch MLX artifacts.
hf download younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX \
--local-dir voxtral_tts_mlx
# Tokenizer + voice embeddings from the base model.
hf download mistralai/Voxtral-4B-TTS-2603 \
tekken.json voice_embedding/* \
--local-dir voxtral_tts_base
Run
unset CPATH
cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \
--model voxtral_tts_mlx/model.pte \
--codec voxtral_tts_mlx/codec_decoder.pte \
--tokenizer voxtral_tts_base/tekken.json \
--voice voxtral_tts_base/voice_embedding/neutral_female.pt \
--text "Hello, how are you today?" \
--output output.wav \
--seed 42 \
--max_new_tokens 200
Output is 24 kHz mono 16-bit PCM. Listen with:
ffplay output.wav
Streaming
Add --streaming to emit codec output in chunks instead of one batch at the
end. Pair it with --speaker to pipe raw f32le PCM to stdout for live
playback:
cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \
--model voxtral_tts_mlx/model.pte \
--codec voxtral_tts_mlx/codec_decoder.pte \
--tokenizer voxtral_tts_base/tekken.json \
--voice voxtral_tts_base/voice_embedding/neutral_female.pt \
--text "Introducing real-time Voxtral TTS streaming on Apple Silicon with the ExecuTorch MLX backend." \
--seed 42 \
--max_new_tokens 200 \
--streaming \
--speaker \
| ffplay -f f32le -sample_rate 24000 -ch_layout mono -nodisp -autoexit -
For aplay instead: ... | aplay -f FLOAT_LE -r 24000 -c 1.
Re-export
python examples/models/voxtral_tts/export_voxtral_tts.py \
--model-path ~/models/Voxtral-4B-TTS-2603 \
--backend mlx \
--dtype bf16 \
--qlinear 4w \
--qembedding 8w \
--output-dir ./voxtral_tts_exports_mlx_4w
--qembedding 8w auto-selects --qembedding-group-size=128. --qlinear-codec
is not yet validated for MLX, so this export keeps the codec unquantized.
Checksums
904131ac1a1e3552ea4ada566c19eb57d654e662f93f906456aa1f8633825688 model.pte
162178ce94732db05bb74d7240a97f2c5a898b8819a29b5d59ebf076aeda8891 codec_decoder.pte
- Downloads last month
- 60
Model tree for younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX
Base model
mistralai/Ministral-3-3B-Base-2512
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir Voxtral-4B-TTS-2603-ExecuTorch-MLX younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX