ποΈ Qwen3-TTS-12Hz-1.7B-Base (ONNX)
π Overview
Qwen3-TTS-12Hz-1.7B-Base-ONNX is the optimization of the Qwen3-TTS framework. This model implements a discrete multi-codec Language Model (LM) architecture capable of 3-second rapid voice cloning with enhanced prosody and vocal fidelity.
The ONNX conversion enables low-latency, cross-platform deployment on both high-end CPUs and NVIDIA GPUs.
π Key Features
- Zero-Shot Voice Cloning: High-similarity cloning (>97%) using only 3 seconds of reference audio.
- Ultra-Low Latency: End-to-end streaming generation as low as 97ms.
- Decoupled Architecture: Separate components for text processing, token generation, and speech synthesis.
- Multilingual Excellence: Native-level pronunciation for 10 major global languages.
- Vocal Richness: 2048-dimensional speaker embeddings for superior similarity.
ποΈ Model Architecture
A complex modular pipeline consisting of:
- Talker (Transformer): 28 layers (Hidden Size: 2048, 8 KV Heads).
- Code Predictor: 5-layer Transformer for multi-codec resolution.
- Vocoder: BigVGAN-based high-fidelity speech decoder.
- Speaker Encoder: ECAPA-TDNN for embedding extraction.
π¦ Model Components (Modular Specs)
| Component | File | Description | Output |
|---|---|---|---|
| Talker Prefill | talker_prefill.onnx |
Initial text processing & KV Cache setup. | Logits & Hidden states. |
| Talker Decode | talker_decode.onnx |
Iterative token generation logic. | New KV Cache. |
| Code Predictor | code_predictor.onnx |
Multi-codec prediction (12Hz). | Multi-codebook codes. |
| Vocoder | vocoder.onnx |
Final waveform synthesis. | 24kHz Audio. |
| Speaker Enc. | speaker_encoder.onnx |
Reference audio analysis. | 2048-dim Embedding. |
π οΈ Installation
pip install onnxruntime-gpu librosa soundfile numpy torch transformers
Model tree for romara-labs/Qwen3-TTS-12Hz-1.7B-Base-ONNX
Base model
Qwen/Qwen3-TTS-12Hz-1.7B-Base