🎙️ Qwen3-TTS-12Hz-1.7B-Base (ONNX)

🚀 Overview

Qwen3-TTS-12Hz-1.7B-Base-ONNX is the optimization of the Qwen3-TTS framework. This model implements a discrete multi-codec Language Model (LM) architecture capable of 3-second rapid voice cloning with enhanced prosody and vocal fidelity.

The ONNX conversion enables low-latency, cross-platform deployment on both high-end CPUs and NVIDIA GPUs.

💎 Key Features

Zero-Shot Voice Cloning: High-similarity cloning (>97%) using only 3 seconds of reference audio.
Ultra-Low Latency: End-to-end streaming generation as low as 97ms.
Decoupled Architecture: Separate components for text processing, token generation, and speech synthesis.
Multilingual Excellence: Native-level pronunciation for 10 major global languages.
Vocal Richness: 2048-dimensional speaker embeddings for superior similarity.

🏗️ Model Architecture

A complex modular pipeline consisting of:

Talker (Transformer): 28 layers (Hidden Size: 2048, 8 KV Heads).
Code Predictor: 5-layer Transformer for multi-codec resolution.
Vocoder: BigVGAN-based high-fidelity speech decoder.
Speaker Encoder: ECAPA-TDNN for embedding extraction.

📦 Model Components (Modular Specs)

Component	File	Description	Output
Talker Prefill	`talker_prefill.onnx`	Initial text processing & KV Cache setup.	Logits & Hidden states.
Talker Decode	`talker_decode.onnx`	Iterative token generation logic.	New KV Cache.
Code Predictor	`code_predictor.onnx`	Multi-codec prediction (12Hz).	Multi-codebook codes.
Vocoder	`vocoder.onnx`	Final waveform synthesis.	24kHz Audio.
Speaker Enc.	`speaker_encoder.onnx`	Reference audio analysis.	2048-dim Embedding.

🛠️ Installation

pip install onnxruntime-gpu librosa soundfile numpy torch transformers

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for romara-labs/Qwen3-TTS-12Hz-1.7B-Base-ONNX

Base model

Qwen/Qwen3-TTS-12Hz-1.7B-Base

Quantized

(13)

this model