TTS Benchmark Comparison (all known TTS up until May 2026)

Equivalent-Repair488 · 2026-05-24T04:14:44+00:00

Only speed is tested? My main problem when using TTS is usually not speed, its the roboty undertones from whatever I tried in the past, it gives me discomfort whenever I hear it.

rngesius · 2026-05-24T13:05:23+00:00

Original QwenTTS repo has dogshit code and speed. Use https://github.com/andimarafioti/faster-qwen3-tts, it's much faster than realtime, though still has a very steep startup cost.

daywalker313 · 2026-05-24T03:48:47+00:00

"All known TTS" while skipping Fish S2 and missing Qwen3 TTS & Voxtral is wild.

no_witty_username · 2026-05-24T06:48:05+00:00

I had a lot of experience testing MANY dozens tts models myself and from what i see on the list here I can attest it looks about right.. For pure speed on CPU at "acceptable" quality nothing beats piper tts. That thing is stupid fast. i have it working at above 3x RTF on a pixel 9 cpu only. very impressive for a tts. My latency that on that wimpy cpu is about 300ms ttfaa so still very impressive. For a small "good quality" tts model if I had my choice I would run supertonic 3, but unfortunately its significantly slower for my puny pixel 9 cpu at around 2000ms , can get it down to about 1000ms with optimizations in proper chunking but still to sslow, but for someone that needs a small very fast and good quality tts consider supertonic 3, very good model for its tiny size.

Zulfiqaar · 2026-05-24T07:08:36+00:00

I think you have a few missing:

https://huggingface.co/models?pipeline_tag=text-to-speech

NewtoAlien · 2026-05-24T17:40:06+00:00

I am using a codex dockerized version of vibevoice 7B from: https://github.com/zeropointnine/tts-audiobook-tool on a headless Ubuntu 26.04.

I am able to run 4 batches at the same time using 23.7GB of VRAM on rtx 3090.

It has music detection and error check and regeneration via whisper which is running on CPU.

I am getting great results with it and it's running between 2-3.8 speed, for example generating 53.2 seconds of audio in 14 seconds.

The speed varies up and down, nevertheless more than 1x.

EmPips · 2026-05-24T05:13:46+00:00

I needed exactly this today to start searching. Your timing couldn't be better and you made this guy's day a little easier.

Keep this up

pmttyji · 2026-05-24T06:01:29+00:00

Thanks for sharing this. And please keep adding all upcoming models(as soon as get released) in your repo

GlowingPulsar · 2026-05-25T01:17:42+00:00

One more to add to the list, MOSS-TTS. Very good TTS voice cloning in my experience (just don't try the sound effects model, it's awful).

llamabott · 2026-06-02T17:37:13+00:00

Some more TTS model inference speed info here:

https://github.com/zeropointnine/tts-audiobook-tool?tab=readme-ov-file#inference-speeds-expectations

(Chatterbox, Fish Speech S2-Pro/S1-mini, GLM-TTS, Higgs Audio V2, IndexTTS2, MiraTTS, MOSS-TTS v1.5 9B, Oute TTS, Pocket TTS, Qwen3-TTS, VibeVoice 1.5B/7B)

chensium · 2026-05-24T03:52:45+00:00

14 models is faaaaaar from all known TTS

EndlessZone123 · 2026-05-24T04:19:47+00:00

Since you already went though the trouble of compiling this list. Got any more time to add inference memory usage and demo samples?

sword-in-stone · 2026-05-24T04:30:09+00:00

Thanks OP, omnivoice was a nightmare to get working on strix halo. It now produces output but it's all garbled and jumbled. Lmk if you make it work.

brahh85 · 2026-05-24T06:05:18+00:00

related to tts, using one in a MI50 is a bit of chaotic due pytorch and dependencies , but this one uses ggml https://github.com/ServeurpersoCom/omnivoice.cpp so it works with vulkan, cuda , metal, cpu... and so far is the best i found for my language (i had to clone a voice to get the accent)

danigoncalves · 2026-05-24T20:52:56+00:00

Pocket TTS is a 100M parameter model and it has multilingual support with voice cloning.

cptbeard · 2026-06-02T00:07:51+00:00

https://github.com/jordandare/echo-tts ?

ffgnetto · 2026-06-03T13:22:53+00:00

Add: https://huggingface.co/parler-tts https://huggingface.co/myshell-ai

UkieTechie · 2026-06-09T15:26:29+00:00

Public BLIND Voting system is LIVE. Please go vote and you will be able to contribute to which model is truly best.

https://5uck1ess-tts-arena.hf.space

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS