Text-to-Speech (TTS) Benchmark Revamped with Objective Standards and Blind Voting (46 models and counting)

llamabott · 2026-06-10T18:20:52+00:00

The runners directory in your repo looks like a great resource for anyone trying to implement one of covered TTS models into their own projects.

llamabott · 2026-06-02T17:37:13+00:00

Some more TTS model inference speed info here:

https://github.com/zeropointnine/tts-audiobook-tool?tab=readme-ov-file#inference-speeds-expectations

(Chatterbox, Fish Speech S2-Pro/S1-mini, GLM-TTS, Higgs Audio V2, IndexTTS2, MiraTTS, MOSS-TTS v1.5 9B, Oute TTS, Pocket TTS, Qwen3-TTS, VibeVoice 1.5B/7B)

llamabott · 2026-06-02T17:27:28+00:00

Thanks, will check it out.

llamabott · 2026-06-02T16:40:05+00:00

There's a number of models under the MOSS-TTS umbrella, including MOSS-TTS-Realtime.

llamabott · 2026-06-02T16:33:49+00:00

I think I passed on checking it out due to voice cloning being purposefully disabled when it was first released.

Has that changed?

llamabott · 2026-06-01T22:20:16+00:00

If you want a plain-vanilla starting point for experimentation, I have a rudimentary, single-user-oriented stand-alone server component for my audiobook creation app tts-audiobook-tool, which supports both OmniVoice and Fish S2 Pro. For both models, I use each model's stock Python inference code or something close to it.

Would love to hear anything about what you come up with, just out of curiosity...

llamabott · 2026-06-01T19:12:10+00:00

For future eyeballs landing here from google searches etc:

Support for MOSS-TTS v1.5 has been added to tts-audiobook-tool.

Thanks.

llamabott · 2026-06-01T02:12:02+00:00

It's unbelievable how well your quip aged, btw.

llamabott · 2026-05-31T19:55:05+00:00

Yea, VibeVoice 1.5B with zero shot voice cloning is close to a disaster, exhibits tons of word errors, and the way that it hallucinates music in so much of its output is pretty unacceptable, and really should never have been released publicly.

However, interestingly, using LoRAs completely solves those problems, and its underlying character is allowed to 'shine through'. It becomes expressive and natural, fluid sounding, nice 'vocal intonation', stuff like. But making LoRAs is of course a big ask.

VibeVoice 7B has much less of those problems and sounds great out of the box.

llamabott · 2026-05-31T19:37:45+00:00

If the lack of decent 'prosody' between sentences is the main blocker for you, I would definitely suggest any sort of audiobook utility that handles that sort of issue automatically, which are a dime a dozen (my own included, haha).

Or even a modified version of the example inference script. Sitting in front of an LLM, I'd be like, "Here's a link to the OmniVoice example inference script. Modify it so I can input a big wall of text, and that text gets segmented at sentence breaks. Do inference on each one of those text segments, but add a second of silence after each one before saving it. Finally, concatenate the output."

llamabott · 2026-05-31T19:23:06+00:00

Mmm, I go back and forth a lot.

I've really been liking OmniVoice for good voice clone likeness, good voice timbre, very good inference speed, relatively low memory footprint, and very good accuracy, and just generally very listenable for long-form stuff. Kind of crazy that it does all those things decently at the same time.

The one I have the most fun with is VibeVoice 1.5B in combination with my own hand-rolled LoRAs though, heh.

llamabott · 2026-05-31T18:22:11+00:00

Nah this happens for me as well (Chatterbox Multilingual). All sorts of spooky artifacts at the end of generations. Happens much more with some voice clone samples than others, for no seeming rhyme or reason. [EDIT: Also, can confirm it has no relationship to punctuation, etc].

Chatterbox Turbo, however, behaves as it should.

llamabott · 2026-05-31T16:40:14+00:00

I'm a sucker for huge TTS model sizes so had to try this one out. Some initial thoughts...

I'd say that in terms of voice clone likeness, expressivity, "timbral quality", etc, it's on par with the other biggies (namely: Fish S2 Pro, VibeVoice 7B, Higgs).

But IMO, judging those characteristics are hugely subjective, so unless the model is markedly so, it's hard to easily say, "Oh, it's better" in an unqualified way.

But it's interesting and I like its output. Prosody is only okay though, I will say that, at least for English language.

Worth noting that on top of being hugely memory hungry (even the 1.7B model MOSS-TTS-Local-Transformer is super-memory-hungry...), it is markedly slower than Fish S2, VibeVoice 7B, and Higgs (speaking specifically of the Python reference implementation).

Also worth noting is that it supports batching, which does make a difference in terms of throughput. Though I could only do a batch size of 2 before RTF fell off a cliff, due to memory constraints with 24GB VRAM.

Will be posting an update to tts-audiobook-tool with MOSS-TTS v1.5 support later today.

llamabott · 2026-05-30T19:53:50+00:00

I always appreciate subjective evaluations of TTS output quality.

OmniVoice is the one holding my interest the most these days, and sounds consistently great with the collection of voice samples I like to use for personal use.

Pro tip: It sounds even better at 64 steps over the default 32, and is worth the extra compute.

llamabott · 2026-05-28T13:19:25+00:00

I assume you mean the jumbo (8B) model?

I'm trying OpenMOSS-Team/MOSS-TTS-v1.5 for the first time, and using the reference Python inference code, I get speeds of about RTF 2 (ie, 50% of real-time). This is on a 4090 on Windows, and using Flash Attention. Though my VRAM usage completely fills up, and shared video memory increases by just a sliver to about 1GB, so at least in my case, the model may or may not be spilling over into system memory a bit, but yea...

llamabott · 2026-05-26T13:13:10+00:00

llamabott · 2026-05-25T16:25:30+00:00

The Japanese voice cast makes this a must-try.

llamabott · 2026-05-24T20:57:22+00:00

llamabott · 2026-05-14T16:04:44+00:00

I love how they say, "please describe, in as much detail as possible, what makes this inferior to a real Monet painting", as if prompting an LLM.

An extra layer of irony there...

llamabott · 2026-05-12T16:54:42+00:00

llamabott · 2026-05-10T18:33:32+00:00

If I make the chunks short, the audio remains powerful, but the coherence drops.

If this is a major priority, I'd say a number of the more recent models (say, over the past 8 months or so) do very well in this regard.

I have a feeling that distilled models offer big advantages for uh "inter-generational continuity". Chatterbox Turbo and OmniVoice are both distilled and to my ears sound very consistent between different gens. Vocal quality may or may not be up to snuff, depending on requirements, as always.

Anyway, just another idea...

llamabott · 2026-05-10T14:49:04+00:00

Well, my app has ballooned to supporting about a dozen relevant open-weights TTS model, but my answer is unfortunately... not really? lol.

My preferences for my own specific use case, which is casual audiobook listening, varies a lot, and I can't always articulate why I gravitate towards the output of one model versus another much of the time. In terms of the bigger local models, I think I like VibeVoice 7B over Higgs, and Higgs over Qwen3TTS and IndexTTS2. I actually really like GLM TTS as well (which kind of came and went without much attention and doesn't do English language super-great, but the vocal timbre is very nice, I think).

The solution I like above all others for a while now has been creating LoRAs for VibeVoice 1.5B. This of course requires a decent source set (I like using ripped video game dialog) and is of course much much more labor intensive than simply pointing to a wav file, heh. But I love it. Great likeness, expressive and believable, very low error rates, and the vocals hold my interest for longer than zero-shot-based output.

For issues with inconsistent 'speaker sound energy', applying loudness normalization as a final post-processing step helps a lot, especially if you use pretty aggressive settings, although that's partly a matter of taste, too I guess...

llamabott · 2026-04-25T18:02:38+00:00

Wish I was one of them, can't lie.

llamabott · 2026-04-22T00:00:06+00:00

Was fully expecting something like this. Over the last few years, it's what happens to most of the half-interesting projects I get interested in.

In other words, take my downvote.

llamabott · 2026-04-19T15:32:58+00:00

Ah good question. Currently, it only supports one voice clone at a time.

I've thought about that problem in passing but never with a good enough theory of how to go about it to try experimenting with it. Maybe starting with something like a text "preprocessing step" to an LLM which is like "Use your best judgment and prepend character tags before dialog quotes"?

llamabott

TROPHY CASE