How do I optimize Gwen3 TTS on a L4?

#33
by Vincent1122112 - opened

I'm trying to get Qwen3 TTS running at production speeds on an NVIDIA L4 (24GB). The quality is perfect, but the latency is too high.

What I’ve already done:

Used torch.compile(mode="reduce-overhead") and Flash Attention 2.

Implemented Concurrent CUDA Streams with threading. I load separate model instances into each stream to try and saturate the GPU.

Used Whisper-Tiny for fast reference audio transcription.

Basically I give gwen a reference audio so that it can generate with a new audio with the reference audio I gave it. For a long prompt it takes around 43 seconds and I want to get it down to around 18ish.I s there anything else I can do? Can I run concurrent generation on Gwen3?

Sign up or log in to comment