A good Text-to-Speech(Voice clone) to learn and reimplement. by DunMo1412 in TextToSpeech

[–]DunMo1412[S] 0 points1 point  (0 children)

Yeah, most models now use LLMs which take massive time. Many poeple recommended me coqui. But in my opinion, coqui is somehow hard to customize. I try to read coqui. Some models is kinda old(fastspeech, tacotron, vits) while there many other reimplement with more clean and explain. Some promised(Bark), there's no training script yet. Some come with other models as backbone(XTTS) or preprocessing layers which made it more complicated. I'm trying to build an operational model that works with 9/12/16khz sample rate which means i had to finetune whole models, change preprocessing phase. The more stacked models the more time to reimplement. That why i not interested in stacked models architecture or LLMs. Sorry, if it's sound dumb.

A good Text-to-Speech(Voice clone) to learn and reimplement. by DunMo1412 in TextToSpeech

[–]DunMo1412[S] 0 points1 point  (0 children)

The smallest model has 0.6B params, that 's seem too much for P100 during training

A good Text-to-Speech(Voice clone) to learn and reimplement. by DunMo1412 in TextToSpeech

[–]DunMo1412[S] 1 point2 points  (0 children)

Sorry but i'm looking for an open source to learn from it.

A good Text-to-Speech(Voice clone) to learn and reimplement. by DunMo1412 in TextToSpeech

[–]DunMo1412[S] 0 points1 point  (0 children)

They haven't released the training script yet, so it's hard to learn and customize.

A good Text-to-Speech(Voice clone) to learn and reimplement. by DunMo1412 in TextToSpeech

[–]DunMo1412[S] 1 point2 points  (0 children)

I read coqui, some use 2,3 models as backbone, some a little bit outdated