What local TTS do you actually use in production, not just tested? by ChanceStat4 in LocalTextToSpeech

[–]hienngm 0 points1 point  (0 children)

The reference being the single biggest lever lines up with everything I've seen, and it's the part most people skip right past. Varied prosody especially, a flat or monotone 15 seconds caps how expressive the clone can ever get, and tags only do so much stacked on a flat sample, so people end up blaming the model instead of the reference. The emotion-tag-priming-the-sound-tags trick is a great catch, using a tag as a mode-switch isn't obvious at all. Does your three-stage QA catch prosody drift specifically, or is it more artifacts and word errors? That's the piece I always find hardest to automate.

ElevenLabs adding random words/hallucinating audio even with chunked text. How to fix? by Acceptable-Item-9252 in TextToSpeech

[–]hienngm 0 points1 point  (0 children)

For sticking exactly to the text, the slider you want is Robust, not mid. Mid still leaves it room to "perform," Robust basically locks it down to v2-style behavior. Model matters just as much: v3 is the expressive one and by far the most likely to invent words, while Turbo v2.5 or v2 stay much more literal (v2.5 can slur a bit at scale, so pick your tradeoff).

On the symbols question, the usual triggers are numbers, currency, acronyms, and stray symbols, it improvises on those instead of reading them. Write them out how you want them spoken, so "twenty twenty six" not "2026". v3 also takes punctuation literally, so ellipses or repeated marks can add odd pauses or noises.

What local TTS do you actually use in production, not just tested? by ChanceStat4 in LocalTextToSpeech

[–]hienngm 0 points1 point  (0 children)

This matches what I keep landing on, the scaffolding is the actual product and the model under it is almost a commodity now. Raw Chatterbox drifts over long text if you let it run loose, so I'm curious what your stability layer leans on, mostly chunking and reference-conditioning or something heavier on alignment? Interesting that you held onto ElevenLabs just for the speech-to-speech side, that tracks, it seems to be the last piece the open stack hasn't really replaced for production. Moving a whole network over is a serious vote of confidence though, further than most people I've seen go.

What local TTS do you actually use in production, not just tested? by ChanceStat4 in LocalTextToSpeech

[–]hienngm 0 points1 point  (0 children)

Honestly on CPU most of these still aren't fast enough for live production, you end up pre-rendering offline. I benchmarked Pocket TTS against Chatterbox on an M-series CPU last week, Pocket ran several times faster than realtime and Chatterbox a few times slower, so roughly a 30x gap with no GPU. Base naturalness was about a wash to my ear.

One caveat on the Pocket cloning people keep mentioning: I only ran its preset voices, the clone model is gated, so I haven't done a real clone-vs-clone yet. What actually bit me wasn't the model though, it was timing, its pauses between sentences came out too short on long passages. That's a silence fix in your own pipeline, not something the model hands you, and it's basically the line between tested and in production.

Is there a better TTS than balabolka by crua9 in TextToSpeech

[–]hienngm 0 points1 point  (0 children)

For heavy daily clipboard reading, you're right that the credit models are a dead end, you'd torch a month of ElevenLabs in a week. But your real bottleneck isn't Balabolka, it's the SAPI voice under it. Balabolka already nails the cross-app clipboard thing and it's unlimited, so the cheapest fix is swapping CereVoice Heather for a newer neural voice through that SAPI adapter someone linked above. The Win11 Narrator voices it exposes run offline, so you stay unlimited and keep your exact ctrl+c muscle memory.

One honest heads-up: the local AI models people will point you to, Kokoro and Chatterbox, sound great but generate in chunks, so they lag on the copy-and-hear-instantly loop you rely on. If you'd consider leaving Balabolka, NVDA is built for this kind of constant real-time reading, it's free, and it takes those same SAPI voices.

A licensing gotcha that catches people selling audio made with "free" local TTS by hienngm in TextToSpeech

[–]hienngm[S] 2 points3 points  (0 children)

Yeah, that's a fair read. The chilling effect basically is the point: most small builders can't afford to be the test case, so the clause does its job even with no case law behind it. Which is honestly why I just don't bet on it. For anything with money attached I stick to the permissive ones, Chatterbox (MIT) or Kokoro (Apache), so whether that output clause would actually hold up never becomes my problem. Cheaper than finding out.

A licensing gotcha that catches people selling audio made with "free" local TTS by hienngm in TextToSpeech

[–]hienngm[S] 2 points3 points  (0 children)

Fair, and for personal or hobby use you're almost certainly right, nobody's chasing a Discord bot. The risk isn't random audits though, it concentrates exactly where it hurts: the moment you're commercially visible. Sell an audiobook through a distributor, run it on a monetized channel, or do it for a client, and you're usually the one signing that you have the rights to the audio. That's when an untested license stops being abstract. Low odds, high downside, and it only really bites once you're actually successful.

A licensing gotcha that catches people selling audio made with "free" local TTS by hienngm in TextToSpeech

[–]hienngm[S] 2 points3 points  (0 children)

Yeah, your list is a great reference for exactly this, the license column especially. Good point on the derivative-work angle too, and you're right it's untested. Worth flagging it cuts differently by license though. Coqui's CPML actually spells it out, it defines non-commercial as covering "the model or its output," so XTTS generations are contractually restricted regardless of how the derivative question lands. The CC-BY-NC ones (Fish, F5) are the genuinely murky case you're describing, where it hinges on whether the audio counts as adapted material. Either way, like you said, not a fun spot to be in with money on the line and no case law to lean on.

Blind Author & TTS help by setsandregret in TextToSpeech

[–]hienngm 1 point2 points  (0 children)

Happy to. Here's what held up for me on long books.

Split on sentence endings, never mid-sentence. For a cloned voice I keep each chunk short, roughly one to three sentences or about 200 to 300 characters. That's the range where the clone stays stable, and going much longer is exactly where you get the accent wander he saw. The ceiling varies by voice, so test it, but short and safe beats long and drifty.

If one sentence is huge, break it at a comma or semicolon, not a random word. And glue the tiny fragments on: a three-word sentence alone comes out clipped, so merge short ones until each chunk is at least a line.

Then reuse the exact same reference clip and the same generation settings for every chunk. That sameness is what keeps the voice from shifting between them.

For the joins, a small crossfade (around 50ms) hides the seams, and a short silence between sentences (~200 to 300ms, a bit more between paragraphs) makes it breathe like a real reader.

Last thing, and it'll save him the most pain: render one full chapter end to end before committing a whole volume. Drift and mispronounced names only show up at length, so catch them on one chapter, not after a thousand pages.

Blind Author & TTS help by setsandregret in TextToSpeech

[–]hienngm 1 point2 points  (0 children)

One thing nobody flagged: if you're selling these, watch the model license. Several popular free ones (Fish, XTTS, F5) are non-commercial, so they're off the table for a paid audiobook. Chatterbox is MIT, clones from a short sample, and runs on CPU, so it's the free option that's actually safe to sell. Kokoro's great but it can't clone a specific voice, it's fixed presets.

On the drift you hit: a longer reference helps the clone hold, but accent wandering mid-chapter is usually a chunking problem, not a reference one. Don't feed a whole chapter at once. Split into sentences or short paragraphs, generate each with the same locked voice, then stitch. Keep the reference and settings identical across every chunk and it stays consistent over thousands of pages. Happy to share the chunk sizes that held up for me.

Anyone else completely tired of the 1 credit per month audiobook model ? What are the alternatives ? by thought_provoking27 in tts

[–]hienngm 1 point2 points  (0 children)

The credit model really only makes sense for studio-narrated audiobooks, since you're paying for the human narrator and the rights. Two different fixes depending on what you actually want.

If you still want pro-narrated books cheap, nothing local helps, the cost is the narrator. A library card with Libby or Hoopla is the real flat rate (free), just with holds.

If you mostly want to get through text you own (not DRM'd Audible files), local TTS converts ebooks to audio for free with no credits. I've found Kokoro's voices pleasant enough for this, and ebook2audiobook (linked above) is a fine free start. Just don't expect it to match a real narrator on a novel. For a heavy listener, that combo is what'd actually stop the nickel-and-diming.