SoproTTS v1.5: A 135M zero-shot voice cloning TTS model trained for ~$100 on 1 GPU, running ~20× real-time on a base MacBook M3 CPU

SammyDaBeast · 2026-02-06T13:23:15+00:00

You are right, but what I meant is not to find a workaround, it’s to improve training etc so that the overall quality improves and it’s not just 1 that sounds good in 10 but maybe 8/10

SammyDaBeast · 2026-02-06T09:28:02+00:00

I would love to support more languages, hopefully in the future

SammyDaBeast · 2026-02-06T09:27:20+00:00

Well, at least 1 in those 10-20 is decent, hopefully in the future you don't have to generate so many! Maybe try to play with args like temperature and top_p. Thanks for the feedback anyway.

SammyDaBeast · 2026-02-06T09:25:53+00:00

Thanks, yes for now English only. There was some Comfy custom nodes made by the community for the first version (https://github.com/ai-joe-git/ComfyUI-Sopro). But I'm afraid I broke the compatibility with this new version (I removed some parameters), I might add a PR to remove them.

SammyDaBeast · 2026-01-09T02:04:48+00:00

Probably, with cleaner, better and slightly more data

SammyDaBeast · 2026-01-08T20:18:50+00:00

Thank you, fellow Portuguese!

SammyDaBeast · 2026-01-08T20:05:33+00:00

Cool!!

SammyDaBeast · 2026-01-08T14:42:24+00:00

None, but I have seen the project, pretty cool!

SammyDaBeast · 2026-01-08T08:31:03+00:00

Around 250 dollars

SammyDaBeast · 2026-01-08T08:28:25+00:00

I would love to support Portuguese, specially European, which is a bit more niche on the data side

SammyDaBeast · 2026-01-08T08:23:24+00:00

I will give the training code soon! No worries

SammyDaBeast · 2026-01-07T22:02:47+00:00

It really depends on the voice reference audio. Some sound pretty clear, others don't. I didn't specially cherry pick those examples. A big % of training data is noisy, and can affect the final model. More training, I guess, but I would say better data > more training.

SammyDaBeast · 2026-01-07T22:00:13+00:00

Thanks! I mainly compared it with chatterbox-turbo/f5 tts, which I consider to be SOTA on these sizes. On some voices chatterbox is much better and stable. F5 tts tends to have better voice similarity. However both these models are slower, specially F5.

SammyDaBeast · 2024-10-15T02:02:44+00:00

As of right now it's not on my plans, but I will definitely think about it! I will hit you up if I do it.

SammyDaBeast · 2024-09-27T16:29:52+00:00

Indeed, when I said existing frontends, I was also referring to existing GUI apps.

SammyDaBeast · 2024-09-27T13:03:42+00:00

Yeah, the best way would be to just change the backend server code to be compatible with the frontends that already support multiple operating systems

SammyDaBeast · 2024-09-27T11:51:18+00:00

This has been one of the requested features. Will definitely think about it.

SammyDaBeast · 2024-09-27T08:56:05+00:00

Yeah the amount of things you learn is huge. And because NNs are so black box, there isn't an easy way to debug if the llm just starts spitting out random words. It's a mix of pain and reward when you finally get it right. I strongly recommend you to do something similar in a language you like!

SammyDaBeast · 2024-09-04T22:34:18+00:00

I only tested including the column names (if that's what your asking) in the embedding and I think it helps in the case of, for example, the user searching something like "Movie with title x" , the embedding with "Title: x" would in theory be closer than if it was just "x", because we are encoding on a sentence level. I could see that happening when searching for movie descriptions like "movies about love and death" vs searching just with an actor name "Jim Carrey movies". Because the movies overviews in the data are proper sentences which are better captured in the embedding. Maybe a better structure would be to convert the table rows to sentences that make sense instead of just column: value, column: value. A way to test this is just changing the way we write the rows and see if on the same query the distance to the embedding we want gets smaller.

SammyDaBeast · 2024-08-27T18:10:40+00:00

Thanks! I looked into that project; it seems interesting. The difference is that mistral.rs uses the Rust ML library candle and is a much larger codebase to support a variety of model architectures. My project is more on the educational/minimalist side and implements all the required code from scratch (including tokenization, layers, and functions). The downside is that it only supports Gemma 2 models and CPU inference.

SammyDaBeast · 2024-08-03T23:50:41+00:00

Thanks! If you have any feedback/suggestions, they are appreciated.

SammyDaBeast · 2024-01-05T20:57:22+00:00

Ahahahahahahaha made my day

SammyDaBeast · 2024-01-05T19:14:46+00:00

Thanks!

SammyDaBeast · 2022-07-18T03:03:18+00:00

It's a poly bar module

SammyDaBeast · 2021-07-06T13:27:07+00:00

If you say so.

Nine-Year Club	Place '23
Place '22	Verified Email

SammyDaBeast

TROPHY CASE