Qwen3-TTS 12Hz 1.7B VoiceDesign โ GGUF (CrispASR)
GGUF / ggml conversions of Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign for use with the qwen3-tts backend in CrispStrobe/CrispASR.
VoiceDesign is the instruct-tuned variant of Qwen3-TTS-12Hz-1.7B. Instead of cloning a reference WAV (Base) or picking from a fixed speaker table (CustomVoice), VoiceDesign generates speech in a voice described by a natural-language instruction โ no reference audio, no preset speaker.
crispasr ... \
--instruct "A young female voice with a slight British accent, energetic, slightly fast paced" \
--tts "Hello, I am an excited engineer."
This repo contains the talker / code-predictor model only. It must be used together with the separate tokenizer / codec GGUF from cstr/qwen3-tts-tokenizer-12hz-GGUF. The model has no speaker-encoder branch โ the instruct text is embedded directly into the talker prefill.
VoiceDesign is 1.7B-only: there is no 0.6B-VoiceDesign weight release upstream.
Files
| File | Size | Notes |
|---|---|---|
qwen3-tts-12hz-1.7b-voicedesign-f16.gguf |
3.6 GB | F16 reference baseline |
qwen3-tts-12hz-1.7b-voicedesign-q8_0.gguf |
1.9 GB | Q8_0, recommended quantised talker |
Quick Start
Build CrispASR:
git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target crispasr
Download the talker + tokenizer:
huggingface-cli download cstr/qwen3-tts-1.7b-voicedesign-GGUF \
qwen3-tts-12hz-1.7b-voicedesign-q8_0.gguf --local-dir .
huggingface-cli download cstr/qwen3-tts-tokenizer-12hz-GGUF \
qwen3-tts-tokenizer-12hz.gguf --local-dir .
Synthesise with a natural-language voice description:
./build/bin/crispasr \
--backend qwen3-tts-1.7b-voicedesign \
-m qwen3-tts-12hz-1.7b-voicedesign-q8_0.gguf \
--codec-model qwen3-tts-tokenizer-12hz.gguf \
--instruct "A young female voice with a slight British accent, energetic, slightly fast paced" \
--tts "Hello, I am an excited engineer." \
--tts-output hello.wav
Or let CrispASR pull both files for you on first run:
./build/bin/crispasr \
--backend qwen3-tts-1.7b-voicedesign -m auto \
--instruct "warm, calm middle-aged male narrator" \
--tts "The story begins on a quiet Tuesday morning." \
--tts-output story.wav
Notes:
--instructis required for VoiceDesign models. Passing--voiceinstead is a CLI error with a hint.- The instruct text is wrapped as
<|im_start|>user\n{instruct}<|im_end|>\nand prepended to the talker prefill; the codec bridge omits the speaker frame entirely (the model has no fixed speaker embedding).
Quantisation Notes
qwen3-tts-12hz-1.7b-voicedesign-f16.gguf- reference baseline (3.6 GB)
qwen3-tts-12hz-1.7b-voicedesign-q8_0.gguf- recommended quantised deployment (1.9 GB)
- ASR-roundtrips word-exact on the English smoke prompt in current CrispASR testing
(
"Hello, I am an excited engineer."โ parakeet-v3 โ"Hello! I am an excited engineer!")
Lower-bit talker quants (q6_k, q5_k, q4_k) can still load but are not numerically faithful to the F16 reference and should be treated as experimental.
The companion tokenizer / codec should stay at F16.
How this was made
- The upstream HF safetensors checkpoint was converted to GGUF F16 by
models/convert-qwen3-tts-to-gguf.py. The converter setsqwen3tts.tts_model_type = "voice_design"from the upstreamconfig.json, which the runtime keys off to switch into the VoiceDesign prefill path. - Quantised variants are produced with CrispASR's GGUF quantiser.
- Inference is implemented in
src/qwen3_tts.cpp; the VoiceDesign-specific prefill builder (build_voicedesign_prefill_embeds) mirrorsQwen3TTSForConditionalGeneration.generateforspeaker_embed=None + instruct_ids(modeling_qwen3_tts.py L2076โL2233).
Reference implementation
Architecture and behaviour were checked against the official Qwen release:
- upstream model card:
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign - upstream repository:
QwenLM/Qwen3-TTS
The CrispASR runtime is a clean C++ / ggml re-implementation for this repo's backend stack.
Related
- Base sibling (voice cloning from reference WAV):
cstr/qwen3-tts-1.7b-base-GGUF - Smaller sibling (Base, 0.6B):
cstr/qwen3-tts-0.6b-base-GGUF - Fixed-speaker sibling (CustomVoice, 0.6B):
cstr/qwen3-tts-0.6b-customvoice-GGUF - Companion tokenizer / codec GGUF:
cstr/qwen3-tts-tokenizer-12hz-GGUF - Upstream VoiceDesign model:
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign - C++ runtime:
CrispStrobe/CrispASR
License
Apache-2.0, inherited from the base model.
- Downloads last month
- 584
8-bit
16-bit
Model tree for cstr/qwen3-tts-1.7b-voicedesign-GGUF
Base model
Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign