--- license: other license_name: nvidia-open-model-license license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf language: - en - es - de - fr - it - vi - zh - hi - ja tags: - coreml - audio - speech - tts - text-to-speech - multilingual - autoregressive - nano-codec - magpie-tts - quantized - int8 base_model: nvidia/magpie_tts_multilingual_357m library_name: coreml pipeline_tag: text-to-speech --- # Magpie-TTS-Multilingual-357M-CoreML-8bit - [speech-swift](https://github.com/soniqo/speech-swift) — Apple SDK - [soniqo.audio](https://soniqo.audio) — website - [blog](https://soniqo.audio/blog) — blog Core ML port of [NVIDIA Magpie-TTS Multilingual 357M](https://huggingface.co/nvidia/magpie_tts_multilingual_357m), an **autoregressive multi-codebook TTS** model over the [Nano-Codec 22 kHz / 1.89 kbps / 21.5 fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps) vocoder, quantized to **INT8** weight-only for Apple Silicon. Core ML INT8 bundle for iOS / macOS. Four compiled .mlmodelc packages with scatter-based KV cache (fully static graph, ANE-friendly). ## Model | | | |---|---| | Total parameters | 357 M (text encoder 99 M + decoder 90 M + LocalTransformer 1 M + NanoCodec 62 M + audio embeddings) | | Architecture | Causal Transformer encoder (6L, d=768) + Causal Transformer decoder (12L, d=768) + LocalTransformer codebook AR (1L, d=256) + Causal HiFi-GAN decoder | | Audio | 8 codebooks × 2024 codes, 22.05 kHz mono, 21.5 fps | | Languages | EN, ES, DE, FR, IT, VI, ZH, HI, JA | | Speakers | 5 baked (John, Sofia, Aria, Jason, Leo) | | Bundle size | 342 MB on disk | | Layout | 4-bundle Core ML (text_encoder / decoder_prefill / decoder_step / nanocodec_decoder) | ## Files | File | Size | Description | |---|---|---| | `text_encoder.mlmodelc/` | 97 MB | text encoder (INT8) | | `decoder_prefill.mlmodelc/` | 87 MB | decoder prefill (INT8) | | `decoder_step.mlmodelc/` | 97 MB | decoder step (INT8) | | `nanocodec_decoder.mlmodelc/` | 61 MB | nanocodec decoder (FP16) | | `manifest.json` | <1 KB | SHA256 + sizes manifest | The 4-bundle layout splits the model into: - **text_encoder** — runs once per utterance over the phoneme sequence - **decoder_prefill** — batch-prefills the 110-step baked speaker context into the KV cache (~10× faster than a sequential cold start) - **decoder_step** — single AR step over the next audio frame; shares weights with decoder_prefill - **nanocodec_decoder** — codes → 22.05 kHz waveform (always FP16; per FluidInference's data, quantizing the codec yields no runtime savings) ## Round-trip validation End-to-end TTS → faster-whisper large-v3 ASR on a held-out sentence per language (Character Error Rate): | Language | en | es | de | fr | it | vi | zh | hi | |---|---|---|---|---|---|---|---|---| | CER | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | <8% (tone) | <2% (1 added interjection) | mixed-script Whisper artifact | ## Usage ```python import json from pathlib import Path import mlx.core as mx # 1. Tokenize text in your app (Swift) — see speech-swift's KokoroTTS # pattern. For Japanese, use Apple's CFStringTokenizer + katakana → IPA. # 2. Load the 3 sub-models and run the AR loop. from huggingface_hub import snapshot_download bundle = Path(snapshot_download("aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit")) # Production usage: see https://github.com/soniqo/speech-swift. ``` The production Swift integration handles tokenization, the AR loop, KV-cache management, and audio rendering. This HuggingFace bundle exists for researchers and SDK developers building atop the MLX weights directly. ## Source - Upstream weights: [nvidia/magpie_tts_multilingual_357m](https://huggingface.co/nvidia/magpie_tts_multilingual_357m) (NVIDIA Open Model License) - Codec: [nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps) - Paper: [NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference](https://arxiv.org/abs/2508.05835v1) ## License **NVIDIA Open Model License** — inherited from upstream Magpie-TTS Multilingual. Suitable for commercial use; please review the license text linked above.