How do I optimize Gwen3 TTS on a L4?
#33
by Vincent1122112 - opened
I'm trying to get Qwen3 TTS running at production speeds on an NVIDIA L4 (24GB). The quality is perfect, but the latency is too high.
What I’ve already done:
Used torch.compile(mode="reduce-overhead") and Flash Attention 2.
Implemented Concurrent CUDA Streams with threading. I load separate model instances into each stream to try and saturate the GPU.
Used Whisper-Tiny for fast reference audio transcription.
Basically I give gwen a reference audio so that it can generate with a new audio with the reference audio I gave it. For a long prompt it takes around 43 seconds and I want to get it down to around 18ish.I s there anything else I can do? Can I run concurrent generation on Gwen3?