Qwen3-TTS 12Hz 1.7B VoiceDesign โ€” GGUF (CrispASR)

GGUF / ggml conversions of Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign for use with the qwen3-tts backend in CrispStrobe/CrispASR.

VoiceDesign is the instruct-tuned variant of Qwen3-TTS-12Hz-1.7B. Instead of cloning a reference WAV (Base) or picking from a fixed speaker table (CustomVoice), VoiceDesign generates speech in a voice described by a natural-language instruction โ€” no reference audio, no preset speaker.

crispasr ... \
    --instruct "A young female voice with a slight British accent, energetic, slightly fast paced" \
    --tts "Hello, I am an excited engineer."

This repo contains the talker / code-predictor model only. It must be used together with the separate tokenizer / codec GGUF from cstr/qwen3-tts-tokenizer-12hz-GGUF. The model has no speaker-encoder branch โ€” the instruct text is embedded directly into the talker prefill.

VoiceDesign is 1.7B-only: there is no 0.6B-VoiceDesign weight release upstream.

Files

File Size Notes
qwen3-tts-12hz-1.7b-voicedesign-f16.gguf 3.6 GB F16 reference baseline
qwen3-tts-12hz-1.7b-voicedesign-q8_0.gguf 1.9 GB Q8_0, recommended quantised talker

Quick Start

Build CrispASR:

git clone https://github.com/CrispStrobe/CrispASR
cd CrispASR
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) --target crispasr

Download the talker + tokenizer:

huggingface-cli download cstr/qwen3-tts-1.7b-voicedesign-GGUF \
    qwen3-tts-12hz-1.7b-voicedesign-q8_0.gguf --local-dir .

huggingface-cli download cstr/qwen3-tts-tokenizer-12hz-GGUF \
    qwen3-tts-tokenizer-12hz.gguf --local-dir .

Synthesise with a natural-language voice description:

./build/bin/crispasr \
    --backend qwen3-tts-1.7b-voicedesign \
    -m qwen3-tts-12hz-1.7b-voicedesign-q8_0.gguf \
    --codec-model qwen3-tts-tokenizer-12hz.gguf \
    --instruct "A young female voice with a slight British accent, energetic, slightly fast paced" \
    --tts "Hello, I am an excited engineer." \
    --tts-output hello.wav

Or let CrispASR pull both files for you on first run:

./build/bin/crispasr \
    --backend qwen3-tts-1.7b-voicedesign -m auto \
    --instruct "warm, calm middle-aged male narrator" \
    --tts "The story begins on a quiet Tuesday morning." \
    --tts-output story.wav

Notes:

  • --instruct is required for VoiceDesign models. Passing --voice instead is a CLI error with a hint.
  • The instruct text is wrapped as <|im_start|>user\n{instruct}<|im_end|>\n and prepended to the talker prefill; the codec bridge omits the speaker frame entirely (the model has no fixed speaker embedding).

Quantisation Notes

  • qwen3-tts-12hz-1.7b-voicedesign-f16.gguf
    • reference baseline (3.6 GB)
  • qwen3-tts-12hz-1.7b-voicedesign-q8_0.gguf
    • recommended quantised deployment (1.9 GB)
    • ASR-roundtrips word-exact on the English smoke prompt in current CrispASR testing ("Hello, I am an excited engineer." โ†’ parakeet-v3 โ†’ "Hello! I am an excited engineer!")

Lower-bit talker quants (q6_k, q5_k, q4_k) can still load but are not numerically faithful to the F16 reference and should be treated as experimental.

The companion tokenizer / codec should stay at F16.

How this was made

  1. The upstream HF safetensors checkpoint was converted to GGUF F16 by models/convert-qwen3-tts-to-gguf.py. The converter sets qwen3tts.tts_model_type = "voice_design" from the upstream config.json, which the runtime keys off to switch into the VoiceDesign prefill path.
  2. Quantised variants are produced with CrispASR's GGUF quantiser.
  3. Inference is implemented in src/qwen3_tts.cpp; the VoiceDesign-specific prefill builder (build_voicedesign_prefill_embeds) mirrors Qwen3TTSForConditionalGeneration.generate for speaker_embed=None + instruct_ids (modeling_qwen3_tts.py L2076โ€“L2233).

Reference implementation

Architecture and behaviour were checked against the official Qwen release:

The CrispASR runtime is a clean C++ / ggml re-implementation for this repo's backend stack.

Related

License

Apache-2.0, inherited from the base model.

Downloads last month
584
GGUF
Model size
2B params
Architecture
qwen3tts
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/qwen3-tts-1.7b-voicedesign-GGUF

Quantized
(5)
this model