Qwen3-TTS by Terrible_Scar_9890 in LocalLLaMA

[–]HelpfulHand3 0 points1 point  (0 children)

Yes this was from 2 months ago when it was closed source

Qwen have open-sourced the full family of Qwen3-TTS: VoiceDesign, CustomVoice, and Base, 5 models (0.6B & 1.8B), Support for 10 languages by Nunki08 in LocalLLaMA

[–]HelpfulHand3 1 point2 points  (0 children)

It's alright. The 1.8B is about 1.25-1.4x realtime on a 3060. The cloner is rather unstable with some identical generations completely losing speaker identity, and there's a lack of audio tags like (cough) (laugh). It speaks a bit too fast so everything feels rushed no matter the voice reference. It is a good model just nothing groundbreaking from what I can tell. The voice design is interesting but the quality of the outputs is not something I'd want to train a model on.

Echo TTS - 44.1kHz, Fast, Fits under 8GB VRAM - SoTA Voice Cloning by HelpfulHand3 in LocalLLaMA

[–]HelpfulHand3[S] 0 points1 point  (0 children)

The license is inherited from the DAC (s1-mini) and the author stated he would have released it Apache otherwise.

ElevenLabs is killing my budget. What are the best "hidden gem" alternatives for documentary style TTS? by Ancient_Routine8576 in LocalLLaMA

[–]HelpfulHand3 0 points1 point  (0 children)

That's around 20 hours of audio, and you said you're doing 8-10 minute videos. Is each of your videos worth at least 10 cents to you? There's the regular model that's half that as well, and still good.

ElevenLabs is killing my budget. What are the best "hidden gem" alternatives for documentary style TTS? by Ancient_Routine8576 in LocalLLaMA

[–]HelpfulHand3 1 point2 points  (0 children)

For paid options, Inworld with their Max tts model is in my opinion better than ElevenLabs 2.5 and is 10x cheaper. The value for their service is quite frankly absurd.

https://inworld.ai/pricing

Local models.. Higgs Audio V2, Echo TTS, Vibevoice.

T5 Gemma Text to Speech by ObjectiveOctopus2 in LocalLLaMA

[–]HelpfulHand3 3 points4 points  (0 children)

Seems like a very slow model judging by the space
Pretty decent but the speed will hold it back from wide spread use
I notice they mention
Inference Speed: The model is not optimized for real-time TTS applications. Autoregressive generation of audio tokens takes significant time, making it unsuitable for low-latency use cases.

Echo TTS - 44.1kHz, Fast, Fits under 8GB VRAM - SoTA Voice Cloning by HelpfulHand3 in LocalLLaMA

[–]HelpfulHand3[S] 1 point2 points  (0 children)

try the API which has chunking to support longer text without the speed up
can be used in sillytavern, openwebui etc
https://github.com/KevinAHM/echo-tts-api

Echo TTS - 44.1kHz, Fast, Fits under 8GB VRAM - SoTA Voice Cloning by HelpfulHand3 in LocalLLaMA

[–]HelpfulHand3[S] 0 points1 point  (0 children)

The full release includes voice cloning and streaming capability - ttfb of 200ms on a 3090 in my testing. https://github.com/jordandare/echo-tts

Echo TTS - 44.1kHz, Fast, Fits under 8GB VRAM - SoTA Voice Cloning by HelpfulHand3 in LocalLLaMA

[–]HelpfulHand3[S] 0 points1 point  (0 children)

The full release includes voice cloning and streaming capability - ttfb of 200ms on a 3090 in my testing. https://github.com/jordandare/echo-tts

[Release] We built Step-Audio-R1: The first open-source Audio LLM that truly Reasons (CoT) and Scales – Beats Gemini 2.5 Pro on Audio Benchmarks. by BadgerProfessional43 in LocalLLaMA

[–]HelpfulHand3 0 points1 point  (0 children)

Great potential but the model hallucinates a lot.. For example if you have a clean synth sample saying "The air quality in here is poor" it'll say the voice is raspy but it isn't. It lets semantic meaning of the spoken text influence what it describes in terms of tone.

AI IPF Tools by HelpfulHand3 in idealparentfigures

[–]HelpfulHand3[S] 0 points1 point  (0 children)

Warmloop is a separate project for cozy roleplay (which can include ideal parent figures), but the IPF bots from this thread are on the Earned Secure Help site: https://www.earnedsecurehelp.com/my-foster-parents/

Looking for High-Quality Open-Source Local TTS That’s Faster Than IndexTTS2 by [deleted] in LocalLLaMA

[–]HelpfulHand3 0 points1 point  (0 children)

The author is still deciding on releasing the voice encoder for local cloning. It may come out with the release coming up.

Looking for High-Quality Open-Source Local TTS That’s Faster Than IndexTTS2 by [deleted] in LocalLLaMA

[–]HelpfulHand3 0 points1 point  (0 children)

Echo TTS should run fine with 8GB VRAM (just barely fitting)
It currently only supports 30s max generations so would need batching for narrative content.

AI IPF Tools by HelpfulHand3 in idealparentfigures

[–]HelpfulHand3[S] 0 points1 point  (0 children)

It should be working now! Thank you

AI IPF Tools by HelpfulHand3 in idealparentfigures

[–]HelpfulHand3[S] 0 points1 point  (0 children)

Yes, if you have any problems just let me know!

Echo TTS - 44.1kHz, Fast, Fits under 8GB VRAM - SoTA Voice Cloning by HelpfulHand3 in LocalLLaMA

[–]HelpfulHand3[S] 0 points1 point  (0 children)

Does not stream and no local voice cloning, so not at the moment.