Instructions to use hugbos/chatterbox-multilingual-ONNX-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Chatterbox
How to use hugbos/chatterbox-multilingual-ONNX-v2 with Chatterbox:
# pip install chatterbox-tts import torchaudio as ta from chatterbox.tts import ChatterboxTTS model = ChatterboxTTS.from_pretrained(device="cuda") text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill." wav = model.generate(text) ta.save("test-1.wav", wav, model.sr) # If you want to synthesize with a different voice, specify the audio prompt AUDIO_PROMPT_PATH="YOUR_FILE.wav" wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH) ta.save("test-2.wav", wav, model.sr) - Notebooks
- Google Colab
- Kaggle
Chatterbox Multilingual โ ONNX v2
ONNX re-export of ResembleAI/chatterbox multilingual TTS, with three functional upgrades over the existing community export at onnx-community/chatterbox-multilingual-ONNX:
- Classifier-free guidance (CFG) baked into the language-model graph. A
single
cfg_weightscalar input combines cond and uncond logits inside the graph; callers don't need to run the LM twice. Matches the behaviour of the PyTorch model. - Alignment attention exposed. Layers 9, 12, and 13 of the T3 Llama
backbone emit their attention weights as an output, enabling an
ONNX-driven inference loop to run
AlignmentStreamAnalyzerand force EOS on short utterances. Fixes the trailing-speech hallucinations reported in resemble-ai/chatterbox#97. - Scatter-free graphs.
ScatterNDops are replaced withWhere/Gather/Concatpatterns throughout, allowing ONNX Runtime to capture the decode loop as a CUDA graph.
Plus fp16 LM weights, fp32 numerically-sensitive islands (softmax / residual sums), and a parameterised CFM step count in the vocoder (N=4 / 6 / 10).
Files
Each .onnx graph has an accompanying .onnx_data sidecar (ONNX external
data format). Keep them together when downloading.
| File | Purpose |
|---|---|
embed_tokens_v2.onnx |
Token + position + exaggeration โ embeddings (batch=2 CFG-ready, scatter-free) |
language_model_v2.onnx |
T3 Llama backbone with CFG + alignment attention, fp16 (~1 GB) |
conditional_decoder_n4.onnx |
S3Gen vocoder, 4 CFM Euler steps (fastest) |
conditional_decoder_n6.onnx |
S3Gen vocoder, 6 CFM Euler steps (balanced, default) |
conditional_decoder_n10.onnx |
S3Gen vocoder, 10 CFM Euler steps (highest quality) |
The speech_encoder.onnx from
onnx-community/chatterbox-multilingual-ONNX
is reused as-is (single-shot on the reference audio, not on the hot path).
Inference
Full reference code: the chatterbox_onnx_conversion_scripts
repo includes chatterbox_multi_inference_script.run_inference(...) that
drives the four graphs end-to-end. Minimal sketch:
import onnxruntime as ort
from huggingface_hub import hf_hub_download
repo = "hugbos/chatterbox-multilingual-ONNX-v2"
for name in ("embed_tokens_v2", "language_model_v2", "conditional_decoder_n6"):
hf_hub_download(repo_id=repo, filename=f"{name}.onnx")
hf_hub_download(repo_id=repo, filename=f"{name}.onnx_data")
providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
lm = ort.InferenceSession("language_model_v2.onnx", providers=providers)
# ... drive LM with (inputs_embeds, attention_mask, cfg_weight, past_key_values.*)
# ... consume (logits, attn_layers, present.*) per step
See the reference inference script for the full driver including CFG batching, alignment analyzer, and vocoder chunking.
Benchmarks
NVIDIA DGX Spark (GB10, aarch64, CUDA 13, sm_121), unified memory,
custom-built onnxruntime-gpu 1.24, warm measurements:
| Backend | Latency (mean, full WAV) | Time-to-first-audio |
|---|---|---|
PyTorch BF16 (upstream ChatterboxMultilingualTTS) |
3.00 s | 3.00 s |
| ONNX v2 (this export) | 1.68 s | 1.68 s (one-shot) |
| ONNX v2 + chunked streaming | 2.05 s | 610 ms |
Streaming numbers above are mean over 6 languages (en / nl / de / fr / es / it), TTSโSTT roundtrip similarity โฅ 80 % (same bar as non-streaming).
Limitations and known behaviour
- Large:
language_model.onnx+ sidecar totals ~1 GB fp16. ColdInferenceSessioncreation is 5โ15 s on NVMe. - CUDA graph capture requires per-shape warmup; see the reference inference script for the prewarm pattern.
- Vocoder variants are separate files, not swappable at runtime โ pick one per deployment.
- The alignment analyzer is conservative: it forces EOS once attention firmly aligns past the end of text. It does not help with model-internal hallucinations that occur before end-of-input.
Model architecture
For the architectural details (T3 Llama 520M backbone, S3Gen flow-matching vocoder, HiFi-GAN decoder), see the upstream model card at ResembleAI/chatterbox.
Source of the export code
All export scripts are in chatterbox_onnx_conversion_scripts,
forked from VladOS95-cyber/onnx_conversion_scripts
with the above improvements added. You can regenerate these ONNX files
locally with chatterbox_to_onnx_conversion_script.export_model_to_onnx(multilingual=True, ...).
License
MIT, same as upstream. See the LICENSE of the base model.
Citation
If you use this export, please cite both the upstream Chatterbox authors and the source repository:
@misc{chatterbox,
author = {Resemble AI},
title = {Chatterbox: multilingual expressive TTS},
year = {2025},
url = {https://github.com/resemble-ai/chatterbox}
}
Acknowledgements
- Resemble AI for the original Chatterbox model.
- VladOS95-cyber for the original ONNX conversion scripts that this export is forked from.
- onnx-community for the community ONNX port that this work builds on.
Model tree for hugbos/chatterbox-multilingual-ONNX-v2
Base model
ResembleAI/chatterbox