---
language:
  - ru
  - zh 
  - en 
  - de
  - es
  - fr
  - ja
  - it
  - pt
  - ko
tags:
  - text-to-speech
  - TTS
  - ONNX
  - qwen3-tts
  - voice-clone
  - streaming
  - qwen3
  - vq
  - rvq
  - ecapa-tdnn
  - multilingual
pipeline_tag: text-to-speech
license: apache-2.0
base_model: Qwen/Qwen3-TTS-12Hz-0.6B-Base
---

# Qwen3-TTS-Streaming ONNX Inference

Pure ONNX Runtime inference pipeline for [Qwen3-TTS-12Hz-0.6B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base), enabling **real-time streaming text-to-speech** without PyTorch and Transformers dependencies at runtime.

## Updates

- As of 2026/04/27, you can synthesize multiple rounds of text with continuous streaming, in addition to the streaming processing within each round.
- As of 2026/05/04, it is now also independent from `transformers` library with standalone `Qwen3TTSTextProcessor` implementation that mimics the original.
- As of 2026/05/06, this system has been integrated in our [streaming-speech-translation](https://huggingface.co/pltobing/streaming-speech-translationhttps://huggingface.co/pltobing/streaming-speech-translation) pipeline. This also includes the revision on codec reset threshold per round, which is slightly longer than the talker, i.e., 125 and 50, respectively, where previously it followed the talker.

## Overview

This repository provides:

- **`qwen3_tts_inferencer_onnx.py`** — Core streaming TTS engine that orchestrates six ONNX models (talker LLM, local talker transformer, codec decoder, speaker encoder, talker codec embedding, text embedding projection) using only NumPy and ONNX Runtime.
- **`test_qwen3-tts-streaming_onnx.py`** — End-to-end test script that simulates LLM streaming text and produces a WAV file.

## Architecture

```
Reference Audio ──► Speaker Encoder ──► Speaker Embedding Vector (voice clone context)
                                           │
                                           ▼
            Text Deltas ──► Talker LLM (Qwen3-0.6B) ──► [Hidden States, VQ Token]
                                                          │
                                                          ▼
                                                Local Transformer ──► 15-codebook RVQ Tokens
                                                                            │
                                                                            ▼
                                                          VQ Token ──►  [4 Frames Chunks] ──► Codec Decoder ──► 24 kHz Waveform Chunks (320 ms)
```

| Component | ONNX Model | Description |
|-----------|------------|-------------|
| Talker LLM | `talker_model_*.onnx` | Qwen3-based talker LM mapping interleaved text+audio tokens embeddings to hidden states and VQ. Maintains a growing KV-cache across the entire generation. |
| Local Talker | `talker_local_model_*.onnx` | Depth-wise decoder generating 15 RVQ codebook entries per frame from talker hidden states and VQ. Creates and discards a fresh KV-cache per frame. |
| LM Head of Local Talker | `talker_local_lm_head.onnx` | Projection head for each of the 15 codebook output of the local talker transformer. |
| Codec Decoder | `codec_decoder_model.onnx` | Decodes VQ+RVQ audio codes back to 24 kHz waveform. Maintains KV-caches and convolutional caches for streaming decode. |
| Speaker Encoder | `speaker_encoder_model.onnx` | ECAPA-TDNN-based speaker encoder. Produces a 1024-dim speaker embedding vector for voice identity cloning. |
| Talker Codec Embed | `talker_codec_embed_model.onnx` | VQ embedding for the talker model. Consists of 2048 token vocabs. |
| Text Embed Projection | `text_embed_proj_model.onnx` | Text embedding and projection for the talker model. Text embedding consists of 151,936 token vocabs. |

## Requirements

```
librosa
numpy
onnxruntime-gpu
python-box
soundfile
```

Example installation with conda env:

```bash
conda create --name qwen3-tts-streaming-onnx-1 python=3.12
conda activate qwen3-tts-streaming-onnx-1
pip install -r requirements.txt
```

## Directory Structure

```
.
├── test_qwen3-tts-streaming_onnx.py        # End-to-end test script
├── README.md
├── requirements.txt
├── qwen3-tts_onnx/  # FP32
│   ├── talker_model_prefill.onnx
│   ├── talker_model_step.onnx
│   ├── talker_local_model_prefill.onnx
│   ├── talker_local_model_step.onnx
│   ├── talker_local_lm_head.onnx
│   ├── codec_decoder_model.onnx
│   ├── speaker_encoder_model.onnx
│   ├── talker_codec_embed_model.onnx
│   └── text_embed_proj_model.onnx
├── configs/
│   ├── config.json                         # Talker, Local Talker, Speaker Encoder config
│   ├── speech_tokenizer_config.json        # Codec config
│   ├── preprocessor_config.json            # Text Processor configs
│   ├── tokenizer_config.json
│   ├── vocab.json
│   └── merges.txt
├── src/
│   ├── inference/
│   │   └── qwen3_tts_inferencer_onnx.py    # Core ONNX inference engine 
│   └── utils/
│       └── audio_utils.py
├── logs/
│   └── <log_synth>.txt
├── audio_ref/
│   └── <reference_speaker>.[wav|mp3|flac]
└── audio_synth/
    └── <synthesized_example>.wav
```

## Usage

### Basic streaming TTS usage

```bash
python -u test_qwen3-tts-streaming_onnx.py >& logs/log_test-streaming-onnx-1.txt
# audio automatically saved in audio_synth/ with default parameters, text, language.
```

### Usage with parameters

```
python test_qwen3-tts-streaming_onnx.py \
    --onnx_dir qwen3-tts_onnx/ \
    --model_config_path configs/config.json \
    --codec_config_path configs/tokenizer_config.json \
    --preprocessor_config_dir configs/ \
    --temperature 0.75 \
    --top_p 0.85 \
    --top_k 50 \
    --repetition_penalty 9.5 \
    --repetition_window 75 \
    --num_threads 4 \
    --audio_ref_path audio_ref/speaker.[wav|flac|mp3] \
    --out_wav output.wav \
    --text "Text to be synthesized" "Yet another text here" "And another" \
    --language "english"
```

### Available Languages

- You can use language name or its code as follows:

```
"chinese", "zh", "english", "en", "german", "de", "italian", "it", "portuguese", "pt",
"spanish", "es", "japanese", "ja", "korean", "ko", "french", "fr", "russian", "ru"
```

### Programmatic Usage

```python
from src.inference import Qwen3TTSInferencerONNX

# Create inferencer
inferencer = Qwen3TTSInferencerONNX(
    talker_prefill, talker_step, talker_local_prefill, talker_local_step,
    talker_local_lm_head, codec_decoder, codec_decoder_dynamic,
    speaker_encoder, talker_codec_embed, text_embed_proj,
    preprocessor_config_dir, model_config, codec_config,
    audio_ref_path, language,
)
inferencer.reset_turn(reset_cache=True, force_reset_codec_cache=True)

# Stream text and collect audio
for delta in your_llm_stream():
    audio_frames = inferencer.push_text(delta)
    ...
    for audio_tokens in audio_frames:
        ...
        inferencer.push_tokens(audio_tokens)
        for wav in inferencer.audio_chunks():
            ...
            yield wav
# End of text and collect audio
audio_frames = inferencer.end_text()
for audio_tokens in audio_frames:
    ...
    inferencer.push_tokens(audio_tokens)
    for wav in inferencer.audio_chunks():
        ...
        yield wav
# Drain remaining audio (text will be with pad token)
audio_frames = inferencer.drain()
for audio_tokens in audio_frames:
    ...
    inferencer.push_tokens(audio_tokens)
    for wav in inferencer.audio_chunks():
        ...
        yield wav
# Flush any remaining audio tokens
for wav in inferencer.flush():
    ...
    yield wav
```

### Command-Line Arguments

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--onnx_dir` | str | "qwen3-tts_onnx/" | Directory path to all onnx models |
| `--preprocessor_config_dir` | str | "configs/" | Directory path to configuration files for the Qwen3 text tokenizer |
| `--model_config_path` | str | "configs/config.json" | Path to original model configuration file for the Qwen3-TTS-12Hz-0.6B-Base |
| `--codec_config_path` | str | "configs/speech_tokenizer_config.json" | Path to original model configuration file for the codec of Qwen3-TTS-12Hz-0.6B-Base |
| `--temperature` | float | `0.75` | Sampling temperature |
| `--top_p` | float | `0.85` | Nucleus sampling threshold |
| `--top_k` | int | `50` | Top-k sampling cutoff |
| `--repetition_penalty` | float | `9.5` | Repetition penalty coefficient |
| `--repetition_window` | int | `75` | Window for repetition penalty |
| `--delta_chunk_chars` | int | `1` | Characters per simulated LLM delta |
| `--delta_delay_s` | float | `0.0` | Delay between simulated deltas (seconds) |
| `--num_threads` | int | `4` | Number of threads used in sess.intra_op_num_threads of the onnxruntime session options |
| `--prompt_wav` | str | audio_ref/male_stewie.mp3 | Reference speaker audio for voice cloning |
| `--out_wav` | str | `out_streaming.wav` | Output WAV file path |
| `--text` | str | *(Russian text)* | Text to synthesize |
| `--language` | str | "russian" | Language of the text to synthesize |

#### By: [Patrick Lumbantobing](https://www.linkedin.com/in/patrick-lumban-tobing)

#### Copyright@[VertoX-AI](https://www.linkedin.com/company/vertoxai/)

### Citation

If you use this system in your research, please cite:

```bibtex
@misc{vertoxai2026qwen3ttsstreamingonnxcudagraph,
  title={Qwen3-TTS-Streaming-ONNX — VertoX-AI},
  author={Tobing, P. L., VertoX-AI},
  year={2026},
  publisher={HuggingFace},
}
```

## License

This project is licensed under the Apache-2.0, the same license as the original Qwen3-TTS.

```
Created by: Patrick Lumbantobing, Vertox-AI
Copyright (c) 2026 Vertox-AI. All rights reserved.

This work is licensed under the Apache License, Version 2.0.
To view a copy of this license, visit [LICENSE](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md).
```

---

## Acknowledgements

- [Qwen3-TTS-12Hz-0.6B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base) for the original Qwen3-TTS model.
- [Qwen3-TTS Technical Report](https://arxiv.org/abs/2601.15621) (Hu et al., 2026).
- [MOSS-TTS-Realtime](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Realtime) for the reference on the streaming engine.
- [ONNX Runtime](https://onnxruntime.ai/) for high-performance cross-platform inference.