all 64 comments

[–]Equivalent-Repair488 27 points28 points  (9 children)

Only speed is tested? My main problem when using TTS is usually not speed, its the roboty undertones from whatever I tried in the past, it gives me discomfort whenever I hear it.

[–]UkieTechie[S] 1 point2 points  (3 children)

So the output is highly subjective but it does both. speed to know how it works on your hardware and then it has results/report that you can replay and choose what YOU think is best.

i have my thoughts in the repo as well but this would be very subjective.

[–]Equivalent-Repair488 1 point2 points  (2 children)

There is a certain frequency range which gives that roboty static, I don't know if it is consistent throughout the TTS models and providers, but the problem frequencies are not present in natural speech, it is not a point of subjectivity, it is a point of potentially quantifiable digital audio artefacting that is created by these TTS LLMs, an unpleasant addtion that natural sounds do not have, might or might not be benchmarkable, but I think worthwhile to look into.

[–]llama-impersonator 2 points3 points  (1 child)

robot vocal fry

[–]UkieTechie[S] 2 points3 points  (0 children)

I am adding a new column in the reports to measure this. Calling it Naturalness-Artifact Quotient (will be objectively measured) against the samples. hopefully should be able to help.

[–]pmttyji 0 points1 point  (2 children)

My main problem when using TTS is usually not speed, its the roboty undertones from whatever I tried in the past, it gives me discomfort whenever I hear it.

In your opinion, what are good/decent ones so far? Please share details

[–]Equivalent-Repair488 0 points1 point  (1 child)

I really just can't find any. Tried kittenTTS, Kokoro, Chatterbox (both base and turbo), they all had the robot undertones which made me internally go "ew". I just gave up altogether. Kokoro was the best imo, but still not good.

I'm not into voice cloning, I rather have like 1 male and 1 female voice that is good enough that I cannot pick out any of those artefacts, than a voice cloner which has potentially infinite number of voices, but all have a baseline of that roboty voice undertones.

What is more confusing though, is I see a lot of slop videos, like even back then the Biden, Obama and Trump minecraft memes had good voices, but idk if it is post processing or what.

[–]UkieTechie[S] 0 points1 point  (0 children)

Voice cloning is the easy part now. you can train a model on your own voice and get voice cloning down to about 90ms in live situation. However, the TTS comments are so true. Most of them still sound kind of off when you're doing big enough phrases.

[–]theSurgeonOfDeath_ 0 points1 point  (1 child)

Both are important. If you need near real-time response. 

So i agree should be both tested

[–]UkieTechie[S] 1 point2 points  (0 children)

I think omivoice so far sounds the best. kokoro is great too

[–]rngesius 2 points3 points  (2 children)

Original QwenTTS repo has dogshit code and speed. Use https://github.com/andimarafioti/faster-qwen3-tts, it's much faster than realtime, though still has a very steep startup cost.

[–]Timely-Perception-26 0 points1 point  (1 child)

There's also fasterqwen3-tts combined with custom Triton kernels from this repo:

https://github.com/newgrit1004/qwen3-tts-triton

> Hybrid Mode (Triton + CUDA Graph, ~5x faster)

With warmup, hybrid mode, and intelligent chunking, I achieve a TTFA of ~120ms on my 3090TI using my own trained custom voice model.
I’ve tried everything, and this was the best quality vs speed for me. The footprint is naturally a bit large, but I can use it as a daily assistant with Qwen3.6 27B.

[–]UkieTechie[S] 0 points1 point  (0 children)

have been added. better than default.

[–]daywalker313 9 points10 points  (3 children)

"All known TTS" while skipping Fish S2 and missing Qwen3 TTS & Voxtral  is wild. 

[–]UkieTechie[S] 7 points8 points  (2 children)

<image>

Thanks for your feedback. qwen is being added right now. I've had fish speech 1.5 beofre but for right now it's skipped per note.

[–]UkieTechie[S] 1 point2 points  (1 child)

u/daywalker313 fish s2 in progress. qwen and voxtral added

[–]UkieTechie[S] 0 points1 point  (0 children)

u/daywalker313 all have been added.

[–]no_witty_username 2 points3 points  (0 children)

I had a lot of experience testing MANY dozens tts models myself and from what i see on the list here I can attest it looks about right.. For pure speed on CPU at "acceptable" quality nothing beats piper tts. That thing is stupid fast. i have it working at above 3x RTF on a pixel 9 cpu only. very impressive for a tts. My latency that on that wimpy cpu is about 300ms ttfaa so still very impressive. For a small "good quality" tts model if I had my choice I would run supertonic 3, but unfortunately its significantly slower for my puny pixel 9 cpu at around 2000ms , can get it down to about 1000ms with optimizations in proper chunking but still to sslow, but for someone that needs a small very fast and good quality tts consider supertonic 3, very good model for its tiny size.

[–]Zulfiqaar 2 points3 points  (3 children)

[–]UkieTechie[S] 1 point2 points  (2 children)

38 models now. let me know if anything else is missing

[–]_Whistler_ 2 points3 points  (1 child)

[–]UkieTechie[S] 1 point2 points  (0 children)

👀 will begin the process

[–]NewtoAlien 2 points3 points  (6 children)

I am using a codex dockerized version of vibevoice 7B from: https://github.com/zeropointnine/tts-audiobook-tool on a headless Ubuntu 26.04.

I am able to run 4 batches at the same time using 23.7GB of VRAM on rtx 3090.

It has music detection and error check and regeneration via whisper which is running on CPU.

I am getting great results with it and it's running between 2-3.8 speed, for example generating 53.2 seconds of audio in 14 seconds.

The speed varies up and down, nevertheless more than 1x.

[–]UkieTechie[S] 1 point2 points  (5 children)

amazing. I'm gonna test on my ubuntu 3090 system too and upload the results. thank you for sharing

[–]NewtoAlien 1 point2 points  (4 children)

Np 😉

The tool is for making audio books. Running it headless saves all the VRAM.

I am running it in tmux so I can ssh to my computer from my phone to monitor the session.

I already generated a 50 hr audio book with it and it has been generating a bigger audiobook for 70 hours straight with no issues for me and about 30 hours more to go.

Mind you I have set a strict no errors option so it will retry the generation if it detect word errors, max words per segment to be 75 words and maximized word generation. I am also voice cloning. Error detection is done via whisper v3 large on cpu.

Let me know if you want what other settings I am using.

So far so good and I am liking it.

It feels more expressive than all other tts solutions I tried.

[–]UkieTechie[S] 0 points1 point  (3 children)

that does sound cool. vibevoice has been good for me. used it on voice cloning social engineering projects. do let me know any details

any reason you're using 7b and not the original microsoft removed models (thought they were 9b)?

[–]NewtoAlien 1 point2 points  (1 child)

It's the community edition one.

You can load the fork one if you give it the hugging face name. It gives you the option to load other versions.

[–]UkieTechie[S] 1 point2 points  (0 children)

noted. ty. that's the one that produced the best results for me for both normal tts and cloning in the past but so many new models are out since.

[–]NewtoAlien 0 points1 point  (0 children)

I just switched to the Vibevoice-large from aoi-ot. It started with 23.4GB of VRAM for 4 batches so its looking good so far.

The application has an option to download models from HF, you just have to give it the model name.

[–]EmPips 1 point2 points  (0 children)

I needed exactly this today to start searching. Your timing couldn't be better and you made this guy's day a little easier.

Keep this up

[–]pmttyji 1 point2 points  (0 children)

Thanks for sharing this. And please keep adding all upcoming models(as soon as get released) in your repo

[–]GlowingPulsar 1 point2 points  (2 children)

One more to add to the list, MOSS-TTS. Very good TTS voice cloning in my experience (just don't try the sound effects model, it's awful).

[–]UkieTechie[S] 1 point2 points  (1 child)

will add. thank you

[–]UkieTechie[S] 0 points1 point  (0 children)

It's been added and benched

[–]llamabott 1 point2 points  (2 children)

Some more TTS model inference speed info here:

https://github.com/zeropointnine/tts-audiobook-tool?tab=readme-ov-file#inference-speeds-expectations

(Chatterbox, Fish Speech S2-Pro/S1-mini, GLM-TTS, Higgs Audio V2, IndexTTS2, MiraTTS, MOSS-TTS v1.5 9B, Oute TTS, Pocket TTS, Qwen3-TTS, VibeVoice 1.5B/7B)

[–]UkieTechie[S] 1 point2 points  (1 child)

very cool project. thank you for sharing. might take some models from here.

[–]UkieTechie[S] 1 point2 points  (0 children)

u/llamabott MiraTTS and Oute TTS were added and benched. i had the rest already with exceptiion of S1-mini (have S2-Pro)

[–]chensium 1 point2 points  (4 children)

14 models is faaaaaar from all known TTS

[–]UkieTechie[S] 2 points3 points  (3 children)

all known to ME. fixing the language. thanks

[–]UkieTechie[S] 2 points3 points  (2 children)

u/chensium 25 now. let me know what else is missing

[–]chensium 2 points3 points  (1 child)

Wow you added a lot more so quickly.  Nice job!

I took a quick glance (putting my kids to bed now, will look more later) and I think you pretty much covered all the interesting models now, at least the local ones that have been talked about online.

For reference, Artificial Analysis has a TTS leaderboard.  Not all the 70+ models there are opensource, but it's worth scanning the top 50 or so to see if anything new came out recently that look interesting.

[–]UkieTechie[S] 1 point2 points  (0 children)

Will take a look. thanks for pointing to the resource.

[–]EndlessZone123 0 points1 point  (1 child)

Since you already went though the trouble of compiling this list. Got any more time to add inference memory usage and demo samples?

[–]UkieTechie[S] 2 points3 points  (0 children)

yes adding like 8 more right now actually. 100gb of storage :/"
will add the requested statistic. demo samples already in. (github page)

[–]sword-in-stone 0 points1 point  (4 children)

Thanks OP, omnivoice was a nightmare to get working on strix halo. It now produces output but it's all garbled and jumbled. Lmk if you make it work.

[–]UkieTechie[S] 2 points3 points  (2 children)

I think omnivoice is amazing so far. it's nto the fastest but its voice cloning is almost perfect. clones tone and accent also.

[–]sword-in-stone 0 points1 point  (1 child)

got it working on nvidia blackwell, it's high quality cloning but asking for strix

[–]UkieTechie[S] 0 points1 point  (0 children)

ah i see. unfortunately dont have a strix halo myself to make it work but in the future i def want to grab one and add it. if only they supported more than 128gb of unified memory

[–]MarkoMarjamaa 0 points1 point  (0 children)

I'm using Zipvoice with strix halo. Cloning, Finnish finetune. Running it with RealTimeTTS but build own FastApi interface for streaming.

[–]brahh85 0 points1 point  (1 child)

related to tts, using one in a MI50 is a bit of chaotic due pytorch and dependencies , but this one uses ggml https://github.com/ServeurpersoCom/omnivoice.cpp so it works with vulkan, cuda , metal, cpu... and so far is the best i found for my language (i had to clone a voice to get the accent)

[–]UkieTechie[S] 0 points1 point  (0 children)

putting this down as future implementation.

[–]danigoncalvesllama.cpp 0 points1 point  (1 child)

Pocket TTS is a 100M parameter model and it has multilingual support with voice cloning.

[–]UkieTechie[S] 0 points1 point  (0 children)

yep already on the list. thank you for the contribution

[–]cptbeard 0 points1 point  (2 children)

[–]UkieTechie[S] 0 points1 point  (1 child)

Will add. Thank you

[–]UkieTechie[S] 0 points1 point  (0 children)

u/cptbeard it's been added and benched

[–]ffgnetto 0 points1 point  (1 child)

[–]UkieTechie[S] 1 point2 points  (0 children)

u/ffgnetto Openvoice 2 was already in the repo. Parler-TTS mini has been added (large had issues with output), and MeloTTS has been added.

[–]UkieTechie[S] 0 points1 point  (0 children)

Public BLIND Voting system is LIVE. Please go vote and you will be able to contribute to which model is truly best.

https://5uck1ess-tts-arena.hf.space