Chatterbox Multilingual — ONNX v2

ONNX re-export of ResembleAI/chatterbox multilingual TTS, with three functional upgrades over the existing community export at onnx-community/chatterbox-multilingual-ONNX:

Classifier-free guidance (CFG) baked into the language-model graph. A single cfg_weight scalar input combines cond and uncond logits inside the graph; callers don't need to run the LM twice. Matches the behaviour of the PyTorch model.
Alignment attention exposed. Layers 9, 12, and 13 of the T3 Llama backbone emit their attention weights as an output, enabling an ONNX-driven inference loop to run AlignmentStreamAnalyzer and force EOS on short utterances. Fixes the trailing-speech hallucinations reported in resemble-ai/chatterbox#97.
Scatter-free graphs. ScatterND ops are replaced with Where/Gather/ Concat patterns throughout, allowing ONNX Runtime to capture the decode loop as a CUDA graph.

Plus fp16 LM weights, fp32 numerically-sensitive islands (softmax / residual sums), and a parameterised CFM step count in the vocoder (N=4 / 6 / 10).

Files

Each .onnx graph has an accompanying .onnx_data sidecar (ONNX external data format). Keep them together when downloading.

File	Purpose
`embed_tokens_v2.onnx`	Token + position + exaggeration → embeddings (batch=2 CFG-ready, scatter-free)
`language_model_v2.onnx`	T3 Llama backbone with CFG + alignment attention, fp16 (~1 GB)
`conditional_decoder_n4.onnx`	S3Gen vocoder, 4 CFM Euler steps (fastest)
`conditional_decoder_n6.onnx`	S3Gen vocoder, 6 CFM Euler steps (balanced, default)
`conditional_decoder_n10.onnx`	S3Gen vocoder, 10 CFM Euler steps (highest quality)

The speech_encoder.onnx from onnx-community/chatterbox-multilingual-ONNX is reused as-is (single-shot on the reference audio, not on the hot path).

Inference

Full reference code: the chatterbox_onnx_conversion_scripts repo includes chatterbox_multi_inference_script.run_inference(...) that drives the four graphs end-to-end. Minimal sketch:

import onnxruntime as ort
from huggingface_hub import hf_hub_download

repo = "hugbos/chatterbox-multilingual-ONNX-v2"
for name in ("embed_tokens_v2", "language_model_v2", "conditional_decoder_n6"):
    hf_hub_download(repo_id=repo, filename=f"{name}.onnx")
    hf_hub_download(repo_id=repo, filename=f"{name}.onnx_data")

providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
lm = ort.InferenceSession("language_model_v2.onnx", providers=providers)
# ... drive LM with (inputs_embeds, attention_mask, cfg_weight, past_key_values.*)
# ... consume (logits, attn_layers, present.*) per step

See the reference inference script for the full driver including CFG batching, alignment analyzer, and vocoder chunking.

Benchmarks

NVIDIA DGX Spark (GB10, aarch64, CUDA 13, sm_121), unified memory, custom-built onnxruntime-gpu 1.24, warm measurements:

Backend	Latency (mean, full WAV)	Time-to-first-audio
PyTorch BF16 (upstream `ChatterboxMultilingualTTS`)	3.00 s	3.00 s
ONNX v2 (this export)	1.68 s	1.68 s (one-shot)
ONNX v2 + chunked streaming	2.05 s	610 ms

Streaming numbers above are mean over 6 languages (en / nl / de / fr / es / it), TTS→STT roundtrip similarity ≥ 80 % (same bar as non-streaming).

Limitations and known behaviour

Large: language_model.onnx + sidecar totals ~1 GB fp16. Cold InferenceSession creation is 5–15 s on NVMe.
CUDA graph capture requires per-shape warmup; see the reference inference script for the prewarm pattern.
Vocoder variants are separate files, not swappable at runtime — pick one per deployment.
The alignment analyzer is conservative: it forces EOS once attention firmly aligns past the end of text. It does not help with model-internal hallucinations that occur before end-of-input.

Model architecture

For the architectural details (T3 Llama 520M backbone, S3Gen flow-matching vocoder, HiFi-GAN decoder), see the upstream model card at ResembleAI/chatterbox.

Source of the export code

All export scripts are in chatterbox_onnx_conversion_scripts, forked from VladOS95-cyber/onnx_conversion_scripts with the above improvements added. You can regenerate these ONNX files locally with chatterbox_to_onnx_conversion_script.export_model_to_onnx(multilingual=True, ...).

License

MIT, same as upstream. See the LICENSE of the base model.

Citation

If you use this export, please cite both the upstream Chatterbox authors and the source repository:

@misc{chatterbox,
  author = {Resemble AI},
  title  = {Chatterbox: multilingual expressive TTS},
  year   = {2025},
  url    = {https://github.com/resemble-ai/chatterbox}
}

Acknowledgements

Resemble AI for the original Chatterbox model.
VladOS95-cyber for the original ONNX conversion scripts that this export is forked from.
onnx-community for the community ONNX port that this work builds on.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for hugbos/chatterbox-multilingual-ONNX-v2

Base model

ResembleAI/chatterbox

Quantized

(18)

this model