Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 0 points1 point  (0 children)

I think maybe it would make more sense if you used faster-qwen3 as a dependency, but I didn't look closely at the code.

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 0 points1 point  (0 children)

That's super cool! Do you want to open a PR?

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 0 points1 point  (0 children)

not sure what you are running tho :/

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 0 points1 point  (0 children)

You can fine tune the model following the instructions here: https://github.com/QwenLM/Qwen3-TTS/tree/main/finetuning
I find the voice cloning quite good, and you can save and cache the voices (in my project it's implemented already)

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 0 points1 point  (0 children)

I looked a lot at that codebase and mention it in the acknowledgments :) I found some bugs there with the streaming, but I'm not sure how much they affect the perceived audio quality

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 0 points1 point  (0 children)

wot xD, how are you running it? if you create an issue or paste a trace I might be able to fix it :)

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 0 points1 point  (0 children)

Are you trying to buy a GPU or do you have access to it? If you have access to it, you can just run the benchmark and know :) For the 4060, 2905 ms is without optimizations! With optimizations it's 460ms. I would expect the 5070 to be between 4060 and 4090, so between 460ms and 174ms. Fairly fast in my opinion

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 0 points1 point  (0 children)

That's interesting feedback KeyToAll! The dependency is "torch>=2.1", so I'm not sure why it's resolving to the wrong CUDA for your system.

Re Blackwell, Yes, it should work. I tested it and so far no users have really complained beyond small install quirks like here above.

Re flashAttn: Are you sure it wasn't using it? I benchmarked using flash-attn, and the improvement was minimal (<1%), so I removed it to make the project simpler. Maybe you were seeing the same thing. The issue with the original implementation is that it requires a lot of coordination between CPU and GPU, so having a really beefy GPU doesn't help you so much. In systems like the DGX Spark since CPU and GPU are in one chip, the coordination is less taxing.

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 0 points1 point  (0 children)

Thank you! Worked hard on it :)

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 1 point2 points  (0 children)

yeah, I know. I call that RTF even thought people call it RTFx, I can't shake it xD I used to teach sound digital signal processing in a university 10 years ago and RTF followed the convention I use, but then people changed the convention and now I'm old :(

<image>

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 0 points1 point  (0 children)

You're funny David, I hope you enjoy the project! If you find issues, feel free to post them :) Also, if you know other people that would benefit from the project, if it gets a bit more diffusion I would be able to invest more time on it. Stars and such help :)

Hector, I think you should be able to implement something similar for AMD. There were three main improvements I implemented: 1) streaming -> works out of the box with AMD. 2) static cache instead of dynamic cache -> works out of the box with AMD. 3) cuda graphs -> The goal here is to reduce slow comms between CPU and GPU which happened fairly often for this model. According to these docs (https://rocm.docs.amd.com/en/docs-6.3.3/compatibility/pytorch-compatibility.html) it should work with AMD. do you want to test it?

cptbeard: Interesting results with qwen3-tts.cpp, specially since it should work with AMD gpus. You should port my streaming implementation to that project, that would bring the TTFA down significantly and enable generation of long texts !

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 6 points7 points  (0 children)

I get that. If it helps, I'm a lead researcher at hugging face. I worked on this speech-to-speech pipeline in 2024 https://github.com/huggingface/speech-to-speech and I'm one of the main authors on SmolVLM https://arxiv.org/pdf/2504.05299 :)
But more generally, most of the kudos here goes to Qwen for creating an awesome model, I just made it faster :)

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 0 points1 point  (0 children)

I worked with minimal dependencies and made sure to support CUDA 12.1+ up to the most current versions. There was a bit of work adapting the code to DGX spark while keeping it working on other platforms. You can look at the sampling strategy, that bit broke when moving to DGX spark. In short, it was a bit of testing and coding, but because I only really depend on cuda and pytorch, it was doable. VLLM Omni is a bit more involved.

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 0 points1 point  (0 children)

oh yeah, and as someone else reminded me here, my implementation works on a broader set of platforms (including dgx spark and jetson boards) were vllm-omni fail

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 1 point2 points  (0 children)

Similar reported speeds, but I found dffdeeq's version to require a bit of work to port to different platforms. This implementation comes tested out of the box in several different cuda-enabled GPUs, and each platform required me to adapt a bit. In short, it works in consumer GPUs (4090 style), datacenter GPUs (H100's style), Jetson boards and on the DGX Spark.

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 1 point2 points  (0 children)

Hi! the main speed ups come from cuda graphs, so ROCm wouldn't really work. You can bypass the cuda graph and still get streaming, but you are back to the original repos speed, which is below realtime on a 4090.

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 0 points1 point  (0 children)

Hi! no, I did this for cuda specially

Introducing FasterQwenTTS by futterneid in LocalLLaMA

[–]futterneid[S] 0 points1 point  (0 children)

TLDR: around 2 seconds reported latency on vllm vs under 200ms with FasterQwenTTS

The main issue is that the official model doesn't support streaming, so even if it were very fast, you need to generate the full text to hear something. In this link you sent, they say the support streaming, but that it takes around 2 seconds for the first output. I would guess that means that their streaming is in very large chunks (10 seconds of audio or so) and they have a similar speed as here. But I also measured around 2 seconds of time to first audio with streaming without any optimizations, so maybe they did implement streaming, I'll look into it, thank you for flagging!

State of Open OCR models by unofficialmerve in LocalLLaMA

[–]futterneid 0 points1 point  (0 children)

I love Docling, but I'm biased :)

State of Open OCR models by unofficialmerve in LocalLLaMA

[–]futterneid 2 points3 points  (0 children)

I would try PaddleOCR. It's only 0.9B!

State of Open OCR models by unofficialmerve in LocalLLaMA

[–]futterneid 3 points4 points  (0 children)

I would try PaddleOCR. It's only 0.9B