--- language: - ru - zh - en - de - es - fr - ja - it - pt - ko tags: - text-to-speech - TTS - ONNX - qwen3-tts - voice-clone - streaming - qwen3 - vq - rvq - ecapa-tdnn - multilingual pipeline_tag: text-to-speech license: apache-2.0 base_model: Qwen/Qwen3-TTS-12Hz-0.6B-Base --- # Qwen3-TTS-Streaming ONNX Inference Pure ONNX Runtime inference pipeline for [Qwen3-TTS-12Hz-0.6B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base), enabling **real-time streaming text-to-speech** without PyTorch and Transformers dependencies at runtime. ## Updates - As of 2026/04/27, you can synthesize multiple rounds of text with continuous streaming, in addition to the streaming processing within each round. - As of 2026/05/04, it is now also independent from `transformers` library with standalone `Qwen3TTSTextProcessor` implementation that mimics the original. - As of 2026/05/06, this system has been integrated in our [streaming-speech-translation](https://huggingface.co/pltobing/streaming-speech-translationhttps://huggingface.co/pltobing/streaming-speech-translation) pipeline. This also includes the revision on codec reset threshold per round, which is slightly longer than the talker, i.e., 125 and 50, respectively, where previously it followed the talker. ## Overview This repository provides: - **`qwen3_tts_inferencer_onnx.py`** — Core streaming TTS engine that orchestrates six ONNX models (talker LLM, local talker transformer, codec decoder, speaker encoder, talker codec embedding, text embedding projection) using only NumPy and ONNX Runtime. - **`test_qwen3-tts-streaming_onnx.py`** — End-to-end test script that simulates LLM streaming text and produces a WAV file. ## Architecture ``` Reference Audio ──► Speaker Encoder ──► Speaker Embedding Vector (voice clone context) │ ▼ Text Deltas ──► Talker LLM (Qwen3-0.6B) ──► [Hidden States, VQ Token] │ ▼ Local Transformer ──► 15-codebook RVQ Tokens │ ▼ VQ Token ──► [4 Frames Chunks] ──► Codec Decoder ──► 24 kHz Waveform Chunks (320 ms) ``` | Component | ONNX Model | Description | |-----------|------------|-------------| | Talker LLM | `talker_model_*.onnx` | Qwen3-based talker LM mapping interleaved text+audio tokens embeddings to hidden states and VQ. Maintains a growing KV-cache across the entire generation. | | Local Talker | `talker_local_model_*.onnx` | Depth-wise decoder generating 15 RVQ codebook entries per frame from talker hidden states and VQ. Creates and discards a fresh KV-cache per frame. | | LM Head of Local Talker | `talker_local_lm_head.onnx` | Projection head for each of the 15 codebook output of the local talker transformer. | | Codec Decoder | `codec_decoder_model.onnx` | Decodes VQ+RVQ audio codes back to 24 kHz waveform. Maintains KV-caches and convolutional caches for streaming decode. | | Speaker Encoder | `speaker_encoder_model.onnx` | ECAPA-TDNN-based speaker encoder. Produces a 1024-dim speaker embedding vector for voice identity cloning. | | Talker Codec Embed | `talker_codec_embed_model.onnx` | VQ embedding for the talker model. Consists of 2048 token vocabs. | | Text Embed Projection | `text_embed_proj_model.onnx` | Text embedding and projection for the talker model. Text embedding consists of 151,936 token vocabs. | ## Requirements ``` librosa numpy onnxruntime-gpu python-box soundfile ``` Example installation with conda env: ```bash conda create --name qwen3-tts-streaming-onnx-1 python=3.12 conda activate qwen3-tts-streaming-onnx-1 pip install -r requirements.txt ``` ## Directory Structure ``` . ├── test_qwen3-tts-streaming_onnx.py # End-to-end test script ├── README.md ├── requirements.txt ├── qwen3-tts_onnx/ # FP32 │ ├── talker_model_prefill.onnx │ ├── talker_model_step.onnx │ ├── talker_local_model_prefill.onnx │ ├── talker_local_model_step.onnx │ ├── talker_local_lm_head.onnx │ ├── codec_decoder_model.onnx │ ├── speaker_encoder_model.onnx │ ├── talker_codec_embed_model.onnx │ └── text_embed_proj_model.onnx ├── configs/ │ ├── config.json # Talker, Local Talker, Speaker Encoder config │ ├── speech_tokenizer_config.json # Codec config │ ├── preprocessor_config.json # Text Processor configs │ ├── tokenizer_config.json │ ├── vocab.json │ └── merges.txt ├── src/ │ ├── inference/ │ │ └── qwen3_tts_inferencer_onnx.py # Core ONNX inference engine │ └── utils/ │ └── audio_utils.py ├── logs/ │ └── .txt ├── audio_ref/ │ └── .[wav|mp3|flac] └── audio_synth/ └── .wav ``` ## Usage ### Basic streaming TTS usage ```bash python -u test_qwen3-tts-streaming_onnx.py >& logs/log_test-streaming-onnx-1.txt # audio automatically saved in audio_synth/ with default parameters, text, language. ``` ### Usage with parameters ``` python test_qwen3-tts-streaming_onnx.py \ --onnx_dir qwen3-tts_onnx/ \ --model_config_path configs/config.json \ --codec_config_path configs/tokenizer_config.json \ --preprocessor_config_dir configs/ \ --temperature 0.75 \ --top_p 0.85 \ --top_k 50 \ --repetition_penalty 9.5 \ --repetition_window 75 \ --num_threads 4 \ --audio_ref_path audio_ref/speaker.[wav|flac|mp3] \ --out_wav output.wav \ --text "Text to be synthesized" "Yet another text here" "And another" \ --language "english" ``` ### Available Languages - You can use language name or its code as follows: ``` "chinese", "zh", "english", "en", "german", "de", "italian", "it", "portuguese", "pt", "spanish", "es", "japanese", "ja", "korean", "ko", "french", "fr", "russian", "ru" ``` ### Programmatic Usage ```python from src.inference import Qwen3TTSInferencerONNX # Create inferencer inferencer = Qwen3TTSInferencerONNX( talker_prefill, talker_step, talker_local_prefill, talker_local_step, talker_local_lm_head, codec_decoder, codec_decoder_dynamic, speaker_encoder, talker_codec_embed, text_embed_proj, preprocessor_config_dir, model_config, codec_config, audio_ref_path, language, ) inferencer.reset_turn(reset_cache=True, force_reset_codec_cache=True) # Stream text and collect audio for delta in your_llm_stream(): audio_frames = inferencer.push_text(delta) ... for audio_tokens in audio_frames: ... inferencer.push_tokens(audio_tokens) for wav in inferencer.audio_chunks(): ... yield wav # End of text and collect audio audio_frames = inferencer.end_text() for audio_tokens in audio_frames: ... inferencer.push_tokens(audio_tokens) for wav in inferencer.audio_chunks(): ... yield wav # Drain remaining audio (text will be with pad token) audio_frames = inferencer.drain() for audio_tokens in audio_frames: ... inferencer.push_tokens(audio_tokens) for wav in inferencer.audio_chunks(): ... yield wav # Flush any remaining audio tokens for wav in inferencer.flush(): ... yield wav ``` ### Command-Line Arguments | Argument | Type | Default | Description | |----------|------|---------|-------------| | `--onnx_dir` | str | "qwen3-tts_onnx/" | Directory path to all onnx models | | `--preprocessor_config_dir` | str | "configs/" | Directory path to configuration files for the Qwen3 text tokenizer | | `--model_config_path` | str | "configs/config.json" | Path to original model configuration file for the Qwen3-TTS-12Hz-0.6B-Base | | `--codec_config_path` | str | "configs/speech_tokenizer_config.json" | Path to original model configuration file for the codec of Qwen3-TTS-12Hz-0.6B-Base | | `--temperature` | float | `0.75` | Sampling temperature | | `--top_p` | float | `0.85` | Nucleus sampling threshold | | `--top_k` | int | `50` | Top-k sampling cutoff | | `--repetition_penalty` | float | `9.5` | Repetition penalty coefficient | | `--repetition_window` | int | `75` | Window for repetition penalty | | `--delta_chunk_chars` | int | `1` | Characters per simulated LLM delta | | `--delta_delay_s` | float | `0.0` | Delay between simulated deltas (seconds) | | `--num_threads` | int | `4` | Number of threads used in sess.intra_op_num_threads of the onnxruntime session options | | `--prompt_wav` | str | audio_ref/male_stewie.mp3 | Reference speaker audio for voice cloning | | `--out_wav` | str | `out_streaming.wav` | Output WAV file path | | `--text` | str | *(Russian text)* | Text to synthesize | | `--language` | str | "russian" | Language of the text to synthesize | #### By: [Patrick Lumbantobing](https://www.linkedin.com/in/patrick-lumban-tobing) #### Copyright@[VertoX-AI](https://www.linkedin.com/company/vertoxai/) ### Citation If you use this system in your research, please cite: ```bibtex @misc{vertoxai2026qwen3ttsstreamingonnxcudagraph, title={Qwen3-TTS-Streaming-ONNX — VertoX-AI}, author={Tobing, P. L., VertoX-AI}, year={2026}, publisher={HuggingFace}, } ``` ## License This project is licensed under the Apache-2.0, the same license as the original Qwen3-TTS. ``` Created by: Patrick Lumbantobing, Vertox-AI Copyright (c) 2026 Vertox-AI. All rights reserved. This work is licensed under the Apache License, Version 2.0. To view a copy of this license, visit [LICENSE](https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md). ``` --- ## Acknowledgements - [Qwen3-TTS-12Hz-0.6B-Base](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base) for the original Qwen3-TTS model. - [Qwen3-TTS Technical Report](https://arxiv.org/abs/2601.15621) (Hu et al., 2026). - [MOSS-TTS-Realtime](https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Realtime) for the reference on the streaming engine. - [ONNX Runtime](https://onnxruntime.ai/) for high-performance cross-platform inference.