πŸŽ™οΈ Qwen3-TTS-12Hz-1.7B-Base (ONNX)

πŸš€ Overview

Qwen3-TTS-12Hz-1.7B-Base-ONNX is the optimization of the Qwen3-TTS framework. This model implements a discrete multi-codec Language Model (LM) architecture capable of 3-second rapid voice cloning with enhanced prosody and vocal fidelity.

The ONNX conversion enables low-latency, cross-platform deployment on both high-end CPUs and NVIDIA GPUs.

πŸ’Ž Key Features

  • Zero-Shot Voice Cloning: High-similarity cloning (>97%) using only 3 seconds of reference audio.
  • Ultra-Low Latency: End-to-end streaming generation as low as 97ms.
  • Decoupled Architecture: Separate components for text processing, token generation, and speech synthesis.
  • Multilingual Excellence: Native-level pronunciation for 10 major global languages.
  • Vocal Richness: 2048-dimensional speaker embeddings for superior similarity.

πŸ—οΈ Model Architecture

A complex modular pipeline consisting of:

  • Talker (Transformer): 28 layers (Hidden Size: 2048, 8 KV Heads).
  • Code Predictor: 5-layer Transformer for multi-codec resolution.
  • Vocoder: BigVGAN-based high-fidelity speech decoder.
  • Speaker Encoder: ECAPA-TDNN for embedding extraction.

πŸ“¦ Model Components (Modular Specs)

Component File Description Output
Talker Prefill talker_prefill.onnx Initial text processing & KV Cache setup. Logits & Hidden states.
Talker Decode talker_decode.onnx Iterative token generation logic. New KV Cache.
Code Predictor code_predictor.onnx Multi-codec prediction (12Hz). Multi-codebook codes.
Vocoder vocoder.onnx Final waveform synthesis. 24kHz Audio.
Speaker Enc. speaker_encoder.onnx Reference audio analysis. 2048-dim Embedding.

πŸ› οΈ Installation

pip install onnxruntime-gpu librosa soundfile numpy torch transformers
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for romara-labs/Qwen3-TTS-12Hz-1.7B-Base-ONNX

Quantized
(13)
this model