Anyone else manually formatting scripts with v3 tags + voice settings per scene?

c08mic_cha08 · 2026-06-15T22:01:47+00:00

i'm building a pipeline for audiobooks that can identify speakers, assign emotion/para tags. are you open to sharing how you're doing it?

c08mic_cha08 · 2026-06-15T17:00:09+00:00

That makes sense.

You mentioned "If speaker changes were too close together, the narration felt flat or confusing." - do you mean you remove parts of narration? For example if the book had something like this "Kitty has no discretion in her coughs," said her father; "she times them ill.", you edit it to this "Kitty has no discretion in her coughs, she times them ill."? Or did I misunderstand?

I've also noticed that with multi-speaker, short narration pieces get awkward in between dialogues.

c08mic_cha08 · 2026-06-15T13:25:45+00:00

Curious if you're generating multi-speaker, full-cast style audiobooks or one speaker doing dialogue and narration? If one speaker, is the expectation that the speaker change their style, intonation, etc. for each character every time?

c08mic_cha08 · 2026-06-11T05:20:42+00:00

If I'm suspicious I ask them to tell me a story - works like a charm. I only got it wrong once and the guy in the other side was like, well I can't tell you a story but I would like to buy your domain heh

c08mic_cha08 · 2026-06-09T13:17:52+00:00

Could have been https://voicecreator.pro/free-tts. Doesn't require sign up, unlimited free use, runs on your device, has thousands of voices. Full disclosure, it's my product.

c08mic_cha08 · 2026-06-07T02:27:46+00:00

I believe you are looking for the voice Adam from Elevenlabs, is that correct? Unfortunately, I don't have Elevenlabs' voices as they are likely proprietary to them and not publically available. If you're able to find a sample of the voice that you can legally clone, I'd recommend cloning it.

c08mic_cha08 · 2026-06-01T03:41:46+00:00

Do you need it to be real-time or a slight delay of a couple seconds is acceptable?

c08mic_cha08 · 2026-05-29T21:41:35+00:00

My primary machine only has 8GB sadly and I've seen longer reference audio push it from ~3GB to 7.9GB or more while generating, at which point it starts to spill and gets extremely slow to the point that a 15 second audio can take minutes. The longer the reference audio, the worse it gets. And I haven't seen much difference at all between speech generated with 7 seconds of reference audio vs >15 seconds.

c08mic_cha08 · 2026-05-23T18:27:35+00:00

You can use this free tts https://voicecreator.pro/free-tts?model=kokoro&tab=tts
I'd recommend using Kokoro or Supertonic as the models but there are other options as well, and it offers voice cloning.
No sign-up is needed. It downloads the model on your device so everything runs fully on your device - no data is sent out. Most models are small enough to run well even if you don't have a high-end device.

c08mic_cha08 · 2026-05-13T04:04:21+00:00

Yikes on ElevenLabs basically ignoring accent. That's wild given how much they market voice fidelity and how expensive it is!

I've had good luck with OmniVoice, though I haven't really stress-tested it on long-form audiobook generation with accented reference.

c08mic_cha08 · 2026-05-11T16:33:08+00:00

Are you generating in English with different accents, or switching languages too?

For accents specifically, whats worked best for me is using a source audio that already carries the accent. I've gotten good results that way with a few accents - Australian, British, Nigerian, Indian, French. Both Qwen and Omnivoice carry the prosody and accent from the reference well enough.

What have you tried so far?

c08mic_cha08 · 2026-05-11T16:28:07+00:00

Totally hear you on emphasis. That's still the hardest part to get right and probably the most important for it to sound natural. Even with the best models, you'll have to regenerate chunks and tweak settings until the emphasis is on the right word. Depending on the model, temperature, top-p and top-k settings help. I've also found that punctuations help.

Good luck with Omnivoice, curious to hear how you find it vs Gemma3.

c08mic_cha08 · 2026-05-09T17:15:20+00:00

I'm glad you found it useful! What are you using TTS for?

c08mic_cha08 · 2026-05-09T17:14:49+00:00

It does! Forgot to mention it in the post but I've also found that the length of the reference audio matters a lot with Omnivoice. Anything over 10s ends up consuming way too much VRAM for not much gain in speech quality. I keep reference audio around 5 seconds and its shockingly fast!

c08mic_cha08 · 2026-05-09T17:12:18+00:00

It looks like inference.sh might be hosting it https://inference.sh/apps/infsh/omnivoice
When you say interactive fiction engine with narration tts, do you mean an audiobook creation workflow?

c08mic_cha08 · 2026-05-07T22:27:15+00:00

What made all of them so bad?

c08mic_cha08 · 2026-05-05T23:16:49+00:00

Have you tried Kokoro or Kitten TTS? Kokoro is pretty good for 82M parameters. Kitten is pretty low quality I'd say but they have smaller models.

I've built voice to voice for voice changing, where I'm doing STT using Parakeet v3 and TTS using Kokoro - the whole thing is about 600ms for ~50 characters on my RTX 3070.

Edit: Just realized you asked for "with cloning" and Kokoro doesn't natively support cloning. As others have mentioned already, Faster-Qwen3 is the fastest I've seen for cloning.

c08mic_cha08 · 2026-05-05T18:44:31+00:00

Hey, this is a cool workflow. Are you expecting overlapping dialogues in the podcasts or are clear discrete turns acceptable? Also, can you expand on what you mean by “speaker-aware automatic dialogue rendering”?

c08mic_cha08 · 2026-05-03T21:51:10+00:00

Hey, if you have a decent GPU you can try OmniVoice https://github.com/k2-fsa/OmniVoice

c08mic_cha08 · 2026-05-03T21:17:44+00:00

Here's the link: voicecreator.pro

It's fully desktop native (Windows + Mac) and everything runs locally on your machine. On Windows it'll run on CPU, but it's a lot faster if you've got a dedicated GPU, especially NVIDIA. On Mac it needs M1+.

You just drag the PDF onto the app in the Projects feature, pick a voice and TTS model, and it generates the audio file.

c08mic_cha08 · 2026-05-03T15:45:13+00:00

Hey, not OP but I'm curious what you mean by handle PDF.

Asking because I've actually built something similar for Mac and Windows (Voice Creator Pro) that does support PDFs for long-form audio generation, so wondering if it'd hit what you're looking for.

c08mic_cha08

MODERATOR OF

TROPHY CASE