---
license: apache-2.0
tags:
  - onnx
  - tts
  - qwen3-tts
  - text-to-speech
base_model: Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice
---

# Qwen3-TTS 12Hz 0.6B CustomVoice — ONNX

ONNX export of [Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice) for local inference with C# / ONNX Runtime.

## Files

| File | Description | Size |
|------|-------------|------|
| `talker_prefill.onnx` + `.data` | Talker LM prefill (28 layers) | ~1.7 GB |
| `talker_decode.onnx` + `.data` | Talker LM single-step decode | ~1.7 GB |
| `code_predictor.onnx` | Code Predictor (5 layers, 15 groups) | ~420 MB |
| `vocoder.onnx` + `.data` | Vocoder decoder (24kHz output) | ~437 MB |
| `embeddings/` | Text/codec embeddings as .npy + config | ~1.4 GB |
| `tokenizer/` | BPE tokenizer (vocab.json, merges.txt) | ~4 MB |

## Usage with C#

```bash
# Clone the app repo
git clone https://github.com/elbruno/qwen-labs-cs.git
cd qwen-labs-cs

# Download models
python python/download_onnx_models.py --repo-id elbruno/Qwen3-TTS-12Hz-0.6B-CustomVoice-ONNX

# Run
dotnet run --project src/QwenTTS -- --model-dir python/onnx_runtime --text "Hello world" --speaker ryan --language english
```

## Architecture

- **Talker**: 28 transformer layers, 16 attn heads, 8 KV heads, hidden=1024
- **Code Predictor**: 5 layers, generates codebook groups 1-15
- **Vocoder**: RVQ dequantize → transformer → BigVGAN decoder, 12Hz → 24kHz (1920× upsample)
- **KV Cache**: Decode uses stacked format (num_layers, B, num_kv_heads, T, head_dim)
- **Speakers**: serena, vivian, uncle_fu, ryan, aiden, ono_anna, sohee, eric, dylan

## License

Apache-2.0 (same as base model)