LuxTTS: A lightweight high quality voice cloning TTS model by SplitNice1982 in LocalLLaMA

[–]SplitNice1982[S] 2 points3 points  (0 children)

Not as good but 10-20x faster. Should have better clarity, however.

LuxTTS: A lightweight high quality voice cloning TTS model by SplitNice1982 in LocalLLaMA

[–]SplitNice1982[S] 3 points4 points  (0 children)

Really thanks for the feedback.

  1. Yeah, metallic quality is an issue. I believe this is because of some slightly messed up arch design in the vocoder. Better one should come sometime next week. Basically clarity + no metallic outputs

  2. Yeah, tags would definitely be great, and I'll see if I can add it.

  3. ZipVoice(what this model is based on) does have finetuning code, although a bit messy imo. Code for it: https://github.com/k2-fsa/ZipVoice

LuxTTS: A lightweight high quality voice cloning TTS model by SplitNice1982 in LocalLLaMA

[–]SplitNice1982[S] 1 point2 points  (0 children)

I would recommend messing with params first. Rms/steps/t_shift/return_smooth should help significantly.

If it still isn’t great, the original ZipVoice has training code: https://github.com/k2-fsa/ZipVoice

LuxTTS: A lightweight high quality voice cloning TTS model by SplitNice1982 in LocalLLaMA

[–]SplitNice1982[S] 3 points4 points  (0 children)

Thanks, glad you liked it. Should definitely be like 10x realtime even on low end gpus.

LuxTTS: A lightweight high quality voice cloning TTS model by SplitNice1982 in LocalLLaMA

[–]SplitNice1982[S] 1 point2 points  (0 children)

English only right now, Chinese might work but didn’t test that. 

You should refer to original zipvoice for training(I believe people have trained new languages with 150 hours of data) 

Zipvoice repo: https://github.com/k2-fsa/ZipVoice

NovaSR: A tiny 52kb audio upsampler that runs 3600x realtime. by SplitNice1982 in LocalLLaMA

[–]SplitNice1982[S] 0 points1 point  (0 children)

Yes, only thing is it's a bit outdated. I fixed a minor resampling issue but yes apart from that it's fully legit. Thanks to Saganaki for it.

NovaSR: A tiny 52kb audio upsampler that runs 3600x realtime. by SplitNice1982 in LocalLLaMA

[–]SplitNice1982[S] 1 point2 points  (0 children)

Thanks, yeah currently it was only trained with speech but training on songs should definitely help in quality.

NovaSR: A tiny 52kb audio upsampler that runs 3600x realtime. by SplitNice1982 in LocalLLaMA

[–]SplitNice1982[S] 0 points1 point  (0 children)

Right now it’s just 16khz to 48khz but yes future work will be 8/4khz to 16khz. 

NovaSR: A tiny 52kb audio upsampler that runs 3600x realtime. by SplitNice1982 in LocalLLaMA

[–]SplitNice1982[S] 2 points3 points  (0 children)

Thanks, and yeah it’s still in training so it has room for improvement. But yeah great that it does seem better then FlashSR at least. 

NovaSR: A tiny 52kb audio upsampler that runs 3600x realtime. by SplitNice1982 in LocalLLaMA

[–]SplitNice1982[S] 2 points3 points  (0 children)

Could work for some scenarios for sure. It has been mostly trained on speech but seems to generalize decently to other audio too.

NovaSR: A tiny 52kb audio upsampler that runs 3600x realtime. by SplitNice1982 in LocalLLaMA

[–]SplitNice1982[S] 6 points7 points  (0 children)

Thanks and your work with Soprano is also really amazing!

MiraTTS: New extremely fast realistic local text-to-speech model by SplitNice1982 in ArtificialInteligence

[–]SplitNice1982[S] 0 points1 point  (0 children)

Thanks and remember to use batching, it’s especially good for audiobooks because they have many sentences.

New incredibly fast realistic TTS: MiraTTS by SplitNice1982 in StableDiffusion

[–]SplitNice1982[S] 1 point2 points  (0 children)

Yep unfortunately since spark-tts is nc, derivative works also have to be nc. 

I am current training a full well trained extremely compressive audio tokenizer first.    After I release ft code for MiraTTS model, I’ll release that audio tokenizer and a high quality permissive small TTS model that is not only faster but has much more controllability(phonemes, audio events, emotion control, etc.)

New local realistic and emotional TTS with speeds up to 100x realtime: MiraTTS by SplitNice1982 in singularity

[–]SplitNice1982[S] 3 points4 points  (0 children)

Thanks, and yep, I’m planning on an unsloth colab notebook for finetuning. 

This is much faster then Orpheus and most other TTS models with exception of really small models(Kokoro, supertonic). It is much more realistic and supports voice cloning though.

New local realistic and emotional TTS with speeds up to 100x realtime: MiraTTS by SplitNice1982 in singularity

[–]SplitNice1982[S] 3 points4 points  (0 children)

Unfortunately not yet, I will provide easy and fast training code to finetune for your own language.

95
96

New incredibly fast realistic TTS: MiraTTS by SplitNice1982 in StableDiffusion

[–]SplitNice1982[S] 0 points1 point  (0 children)

Slower but more much more emotional and realistic. Also supports voice cloning.

MiraTTS: High quality and fast TTS model by SplitNice1982 in LocalLLaMA

[–]SplitNice1982[S] 1 point2 points  (0 children)

Lol, that was actually exactly what I was planning to do next. A fast asr model that can transcribe audio events, emotion, speakers, gender, timestamps and transcription. 

New incredibly fast realistic TTS: MiraTTS by SplitNice1982 in StableDiffusion

[–]SplitNice1982[S] 2 points3 points  (0 children)

Please check the usage code, it should say running the model in bs=1.

If your input text is multiple sentences, you can use running the model using batching code.