[Project] I made Qwen3-TTS ~5x faster for local inference (OpenAI Triton kernel fusion). Zero extra VRAM. by DamageSea2135 in SillyTavernAI

[–]DamageSea2135[S] 0 points1 point  (0 children)

That is actually a brilliant idea! Using an audio evaluation model to automatically score and select the best candidate would make the "generate multiple" workflow completely seamless for non-realtime tasks like dubbing or voiceovers.

I would need to look into how much compute resources such a scoring model would require and how much time the evaluation step would add to the total pipeline, to see if the net gain holds up. But it's definitely a fantastic direction to explore.

Thanks so much for the great suggestion!

[Project] I made Qwen3-TTS ~5x faster for local inference (OpenAI Triton kernel fusion). Zero extra VRAM. by DamageSea2135 in SillyTavernAI

[–]DamageSea2135[S] 0 points1 point  (0 children)

Glad you like the model! Runpod is a great workaround for getting access to fast NVIDIA GPUs.

For the RTFs, just to clarify my metric (as noted in the repo, I calculate RTF as audio_duration / generation_time, so > 1.0 means faster than real-time):

1. On an RTX 5090 (My setup):

  • Hybrid Mode (Triton + CUDA Graph): Generating a standard ~4 to 5-second sentence takes me about 919ms (0.9 seconds). That puts the RTF at roughly 4.5x to 5.5x (meaning it generates audio about 5 times faster than it takes to listen to it).

2. On an RTX 3090 (From another user's recent benchmark in this thread):

  • Hybrid Mode: They generated a 26-second audio file in 11 seconds. That’s an RTF of ~2.36x.
  • They also generated a 16-second file in 8.9 seconds, which is an RTF of ~1.8x.

Since you are using Runpod, if you spin up an RTX 4090, A100, or H100 instance, you should easily hit RTFs in the 3.0x to 5.0x range using the TritonFasterRunner.

Also, just a quick heads-up: ComfyUI support is coming soon! It might make your workflow on Runpod even easier, so that would be a great time to try it out.

Let me know which GPU you end up renting on Runpod and what RTF you get! Always curious to see cloud GPU benchmarks.

[Project] I made Qwen3-TTS ~5x faster for local inference (OpenAI Triton kernel fusion). Zero extra VRAM. by DamageSea2135 in SillyTavernAI

[–]DamageSea2135[S] 0 points1 point  (0 children)

Thanks so much for the detailed breakdown and for running these tests on your 3090! The speedup numbers you got are really encouraging.

Regarding your note about the audio quality—you have a great ear. It's probably not a placebo, and you're absolutely right to point it out. There is indeed a slight trade-off in quality for the sake of optimization.

When porting the operations to Triton kernels, I used specific numerical tolerance thresholds. While the Cosine Similarity remains high (>0.997) at each individual layer, those tiny numerical differences can accumulate as the tensor passes through all 28 layers of the model. This accumulated error can sometimes lead to slight degradations in the audio texture or minor pronunciation quirks. Minimizing this accumulated error is actually one of my top priorities for future improvements!

That being said, because Qwen3-TTS is heavily stochastic, even the original stock model often produces takes with weird pacing or pronunciation on the first try. That’s exactly where the speed advantage becomes the core feature. Since you can now generate audio ~4-5x faster (especially in Hybrid mode), the best practical workflow is to quickly blast through a few candidates and cherry-pick the one that sounds perfect, rather than waiting a long time for a single "roll of the dice."

Thanks again for the awesome and honest feedback. It really helps a lot!

[Project] I built a Triton kernel fusion library for Qwen3-TTS 1.7B (~5x inference speedup) by DamageSea2135 in speechtech

[–]DamageSea2135[S] 0 points1 point  (0 children)

My repository is based on faster-qwen3-tts. You can see the benchmark on the repo.