How to use from the
Use from the
MLX library
# Download the model from the Hub
pip install huggingface_hub[hf_xet]

huggingface-cli download --local-dir Voxtral-4B-TTS-2603-ExecuTorch-MLX younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX

Voxtral-4B-TTS-2603-ExecuTorch-MLX

Pre-exported ExecuTorch artifacts for Voxtral-4B-TTS-2603 with the MLX backend for Apple Silicon. The LM decoder and flow head use bf16 precision with 4-bit weight-only linear quantization and 8-bit embedding quantization. The codec decoder is exported unquantized and lowered natively to MLX.

This repository is the Apple Silicon companion to the CUDA artifact repo: younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-CUDA.

Overview

The pipeline has two stages: export (Python, once) and inference (C++ runner, repeated). This repo ships the export outputs so you can skip straight to inference on a locally built ExecuTorch MLX runner.

The model has three components:

  1. Mistral 4B LLM decoder โ€” autoregressive text to hidden states
  2. Flow Matching Head โ€” hidden states to 37 audio codebook tokens per frame
  3. Codec Decoder โ€” codebook tokens to 24 kHz mono waveform

Files

File Size What
model.pte 2.20 GiB LM decoder, token embedding, audio embedding, semantic head, and flow velocity methods lowered to MLX
codec_decoder.pte 289 MiB Native MLX codec decoder for waveform synthesis

The tokenizer and voice embeddings are not included. Download them from the base model so they match the upstream Voxtral release.

Performance

Validated on Apple Silicon with seed=42 and prompt "Hello, how are you today?".

Config Audio Generate time Generation RTF Process wall Notes
MLX bf16 + 4w linear + 8w embedding 3.44 s 2932 ms 0.852326 4.20 s refreshed after MLX indexing fix
MLX bf16 + 4w linear + 8w embedding 3.44 s 3132 ms 0.910465 5.19 s first measured run
MLX bf16 + 4w linear + 8w embedding 3.44 s 2634 ms 0.765698 3.15 s warm run
MLX bf16 + 4w linear + 8w embedding 3.44 s 2607 ms 0.757849 3.13 s warm run

Latest WAV quality check: peak 0.425764, clipped samples 0. Apple Speech transcribed the original generated sample as Hello how are you today.

Prerequisites

  • macOS on Apple Silicon.
  • ExecuTorch built from source with EXECUTORCH_BUILD_MLX=ON.
  • Tokenizer and voice embeddings from mistralai/Voxtral-4B-TTS-2603.
git clone https://github.com/pytorch/executorch ~/executorch
cd ~/executorch

./install_executorch.sh
pip install -e . --no-build-isolation
make voxtral_tts-mlx

The native codec artifacts were validated against ExecuTorch source commit:

ba5b038400299a383dbe93ab394a30f42a953cc1

Download

pip install huggingface_hub

# ExecuTorch MLX artifacts.
hf download younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX \
    --local-dir voxtral_tts_mlx

# Tokenizer + voice embeddings from the base model.
hf download mistralai/Voxtral-4B-TTS-2603 \
    tekken.json voice_embedding/* \
    --local-dir voxtral_tts_base

Run

unset CPATH

cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \
    --model voxtral_tts_mlx/model.pte \
    --codec voxtral_tts_mlx/codec_decoder.pte \
    --tokenizer voxtral_tts_base/tekken.json \
    --voice voxtral_tts_base/voice_embedding/neutral_female.pt \
    --text "Hello, how are you today?" \
    --output output.wav \
    --seed 42 \
    --max_new_tokens 200

Output is 24 kHz mono 16-bit PCM. Listen with:

ffplay output.wav

Streaming

Add --streaming to emit codec output in chunks instead of one batch at the end. Pair it with --speaker to pipe raw f32le PCM to stdout for live playback:

cmake-out/examples/models/voxtral_tts/voxtral_tts_runner \
    --model voxtral_tts_mlx/model.pte \
    --codec voxtral_tts_mlx/codec_decoder.pte \
    --tokenizer voxtral_tts_base/tekken.json \
    --voice voxtral_tts_base/voice_embedding/neutral_female.pt \
    --text "Introducing real-time Voxtral TTS streaming on Apple Silicon with the ExecuTorch MLX backend." \
    --seed 42 \
    --max_new_tokens 200 \
    --streaming \
    --speaker \
  | ffplay -f f32le -sample_rate 24000 -ch_layout mono -nodisp -autoexit -

For aplay instead: ... | aplay -f FLOAT_LE -r 24000 -c 1.

Re-export

python examples/models/voxtral_tts/export_voxtral_tts.py \
    --model-path ~/models/Voxtral-4B-TTS-2603 \
    --backend mlx \
    --dtype bf16 \
    --qlinear 4w \
    --qembedding 8w \
    --output-dir ./voxtral_tts_exports_mlx_4w

--qembedding 8w auto-selects --qembedding-group-size=128. --qlinear-codec is not yet validated for MLX, so this export keeps the codec unquantized.

Checksums

904131ac1a1e3552ea4ada566c19eb57d654e662f93f906456aa1f8633825688  model.pte
162178ce94732db05bb74d7240a97f2c5a898b8819a29b5d59ebf076aeda8891  codec_decoder.pte
Downloads last month
60
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for younghan-meta/Voxtral-4B-TTS-2603-ExecuTorch-MLX

Quantized
(12)
this model