---
license: other
license_name: nvidia-open-model-license
license_link: https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
language:
  - en
  - es
  - de
  - fr
  - it
  - vi
  - zh
  - hi
  - ja
tags:
  - coreml
  - audio
  - speech
  - tts
  - text-to-speech
  - multilingual
  - autoregressive
  - nano-codec
  - magpie-tts
  - quantized
  - int8
base_model: nvidia/magpie_tts_multilingual_357m
library_name: coreml
pipeline_tag: text-to-speech
---

# Magpie-TTS-Multilingual-357M-CoreML-8bit

- [speech-swift](https://github.com/soniqo/speech-swift) — Apple SDK
- [soniqo.audio](https://soniqo.audio) — website
- [blog](https://soniqo.audio/blog) — blog

Core ML port of [NVIDIA Magpie-TTS Multilingual 357M](https://huggingface.co/nvidia/magpie_tts_multilingual_357m),
an **autoregressive multi-codebook TTS** model over the [Nano-Codec
22 kHz / 1.89 kbps / 21.5 fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps)
vocoder, quantized to **INT8** weight-only for Apple Silicon.

Core ML INT8 bundle for iOS / macOS.  Four compiled .mlmodelc packages with scatter-based KV cache (fully static graph, ANE-friendly).

## Model

| | |
|---|---|
| Total parameters | 357 M (text encoder 99 M + decoder 90 M + LocalTransformer 1 M + NanoCodec 62 M + audio embeddings) |
| Architecture | Causal Transformer encoder (6L, d=768) + Causal Transformer decoder (12L, d=768) + LocalTransformer codebook AR (1L, d=256) + Causal HiFi-GAN decoder |
| Audio | 8 codebooks × 2024 codes, 22.05 kHz mono, 21.5 fps |
| Languages | EN, ES, DE, FR, IT, VI, ZH, HI, JA |
| Speakers | 5 baked (John, Sofia, Aria, Jason, Leo) |
| Bundle size | 342 MB on disk |
| Layout | 4-bundle Core ML (text_encoder / decoder_prefill / decoder_step / nanocodec_decoder) |

## Files

| File | Size | Description |
|---|---|---|
| `text_encoder.mlmodelc/` | 97 MB | text encoder (INT8) |
| `decoder_prefill.mlmodelc/` | 87 MB | decoder prefill (INT8) |
| `decoder_step.mlmodelc/` | 97 MB | decoder step (INT8) |
| `nanocodec_decoder.mlmodelc/` | 61 MB | nanocodec decoder (FP16) |
| `manifest.json` | <1 KB | SHA256 + sizes manifest |

The 4-bundle layout splits the model into:

- **text_encoder** — runs once per utterance over the phoneme sequence
- **decoder_prefill** — batch-prefills the 110-step baked speaker context into the KV cache (~10× faster than a sequential cold start)
- **decoder_step** — single AR step over the next audio frame; shares weights with decoder_prefill
- **nanocodec_decoder** — codes → 22.05 kHz waveform (always FP16; per FluidInference's data, quantizing the codec yields no runtime savings)

## Round-trip validation

End-to-end TTS → faster-whisper large-v3 ASR on a held-out sentence per language (Character Error Rate):

| Language | en | es | de | fr | it | vi | zh | hi |
|---|---|---|---|---|---|---|---|---|
| CER | 0.00% | 0.00% | 0.00% | 0.00% | 0.00% | <8% (tone) | <2% (1 added interjection) | mixed-script Whisper artifact |

## Usage

```python
import json
from pathlib import Path
import mlx.core as mx

# 1. Tokenize text in your app (Swift) — see speech-swift's KokoroTTS
#    pattern. For Japanese, use Apple's CFStringTokenizer + katakana → IPA.
# 2. Load the 3 sub-models and run the AR loop.
from huggingface_hub import snapshot_download
bundle = Path(snapshot_download("aufklarer/Magpie-TTS-Multilingual-357M-CoreML-8bit"))

# Production usage: see https://github.com/soniqo/speech-swift.
```

The production Swift integration handles tokenization, the AR loop, KV-cache
management, and audio rendering.  This HuggingFace bundle exists for
researchers and SDK developers building atop the MLX weights directly.

## Source

- Upstream weights: [nvidia/magpie_tts_multilingual_357m](https://huggingface.co/nvidia/magpie_tts_multilingual_357m) (NVIDIA Open Model License)
- Codec: [nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps](https://huggingface.co/nvidia/nemo-nano-codec-22khz-1.89kbps-21.5fps)
- Paper: [NanoCodec: Towards High-Quality Ultra Fast Speech LLM Inference](https://arxiv.org/abs/2508.05835v1)

## License

**NVIDIA Open Model License** — inherited from upstream Magpie-TTS Multilingual.
Suitable for commercial use; please review the license text linked above.