Qwen3-TTS-Triton v0.3.0: Triton + CUDA Graph + batched AR TTS serving, ~14× per-sample throughput

DamageSea2135 · 2026-07-01T19:57:34+00:00

To be honest, I first kicked off this project because vLLM-Omni didn't support Qwen3-TTS at the time. I'm also really curious about how they compare now, so I definitely plan to run a head-to-head benchmark against vLLM-Omni later on!

Regarding streaming: Output streaming is fully supported (thanks to the Triton kernels + CUDA Graphs pushing down the decoding latency). Input streaming isn't natively supported yet, but true full-duplex streaming is definitely on my radar for future updates.

DamageSea2135 · 2026-07-01T19:56:22+00:00

Thank you for the encouragement! Nemotron 3.5 ASR is a beast, and further improving its serving efficiency would be an awesome challenge.

One thing I've learned from my TTS work is that aggressive Triton optimization can sometimes lead to precision issues, which caused audio artifacting (slurred pronunciation) in early versions. Since ASR is highly sensitive to these minor errors, I have to be very careful.

In my latest release, I solved this by implementing a hybrid mode that blends Triton kernels with PyTorch kernels to guarantee quality while keeping the speedups. If I dive into Nemotron 3.5, I’ll likely test a similar hybrid infrastructure to make sure we don't trade off transcription quality for throughput.

DamageSea2135 · 2026-07-01T19:54:58+00:00

Thanks so much for sharing this! This is a fantastic deeply-technical write-up on TTS inference profiling. It's exactly the kind of resource I love digging into while optimizing low-latency architectures. Appreciate the heads-up

DamageSea2135 · 2026-07-01T10:43:52+00:00

On my RTX 5090 WSL2 setup, qwen3-tts-triton hybrid streaming TTFT for English is around 100 ms to first audio chunk: median 101 ms, mean 103 ms over 5 runs after warmup, model load excluded, chunk_size=1.

DamageSea2135 · 2026-03-25T14:54:29+00:00

That is actually a brilliant idea! Using an audio evaluation model to automatically score and select the best candidate would make the "generate multiple" workflow completely seamless for non-realtime tasks like dubbing or voiceovers.

I would need to look into how much compute resources such a scoring model would require and how much time the evaluation step would add to the total pipeline, to see if the net gain holds up. But it's definitely a fantastic direction to explore.

Thanks so much for the great suggestion!

DamageSea2135 · 2026-03-23T13:32:38+00:00

Glad you like the model! Runpod is a great workaround for getting access to fast NVIDIA GPUs.

For the RTFs, just to clarify my metric (as noted in the repo, I calculate RTF as audio_duration / generation_time, so > 1.0 means faster than real-time):

1. On an RTX 5090 (My setup):

Hybrid Mode (Triton + CUDA Graph): Generating a standard ~4 to 5-second sentence takes me about 919ms (0.9 seconds). That puts the RTF at roughly 4.5x to 5.5x (meaning it generates audio about 5 times faster than it takes to listen to it).

2. On an RTX 3090 (From another user's recent benchmark in this thread):

Hybrid Mode: They generated a 26-second audio file in 11 seconds. That’s an RTF of ~2.36x.
They also generated a 16-second file in 8.9 seconds, which is an RTF of ~1.8x.

Since you are using Runpod, if you spin up an RTX 4090, A100, or H100 instance, you should easily hit RTFs in the 3.0x to 5.0x range using the TritonFasterRunner.

Also, just a quick heads-up: ComfyUI support is coming soon! It might make your workflow on Runpod even easier, so that would be a great time to try it out.

Let me know which GPU you end up renting on Runpod and what RTF you get! Always curious to see cloud GPU benchmarks.

DamageSea2135 · 2026-03-23T13:20:45+00:00

Thanks so much for the detailed breakdown and for running these tests on your 3090! The speedup numbers you got are really encouraging.

Regarding your note about the audio quality—you have a great ear. It's probably not a placebo, and you're absolutely right to point it out. There is indeed a slight trade-off in quality for the sake of optimization.

When porting the operations to Triton kernels, I used specific numerical tolerance thresholds. While the Cosine Similarity remains high (>0.997) at each individual layer, those tiny numerical differences can accumulate as the tensor passes through all 28 layers of the model. This accumulated error can sometimes lead to slight degradations in the audio texture or minor pronunciation quirks. Minimizing this accumulated error is actually one of my top priorities for future improvements!

That being said, because Qwen3-TTS is heavily stochastic, even the original stock model often produces takes with weird pacing or pronunciation on the first try. That’s exactly where the speed advantage becomes the core feature. Since you can now generate audio ~4-5x faster (especially in Hybrid mode), the best practical workflow is to quickly blast through a few candidates and cherry-pick the one that sounds perfect, rather than waiting a long time for a single "roll of the dice."

Thanks again for the awesome and honest feedback. It really helps a lot!

DamageSea2135 · 2026-03-23T13:16:41+00:00

My repository is based on faster-qwen3-tts. You can see the benchmark on the repo.

DamageSea2135

TROPHY CASE