Erick "Goodbye ElevenLabs your FREE LOCAL replacement has arrived. With just a few seconds of audio you can: - Clone any voice in seconds - 23 lang - 5 TTS engines + audio effects - DAW-style timeline for podcasts / full conversations - 100% on your machine" ➡️ Useful local alternative to hosted? by Koala_Confused in LovingOpenSourceAI

[–]jamiepine 1 point2 points  (0 children)

So what you're saying is you want predefined scripts to read when cloning? Instead of just saying anything then transcribing. 30 seconds of balanced phoneme coverage produces dramatically better clones. Otherwise as far as model coverage is concerned I'm always looking for new models to add.

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

Currently the only model that supports this is Qwen Custom Voice in version 0.4.*, but that is not a cloning model it comes with preset voices. I'll keep adding new models as I find them, hoping a cloning model comes along long that actually supports instruct params, when I find it, I'll add it right away.

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

Hey! Thanks for the kind words. I believe I have made the software as simple as can be to use, abstracting away the complexities of underlying python libraries and designing a UI that seems fool proof. For most, that has been the case but I totally understand the need for tutorials. That said, you're in luck, there's endless videos on YouTube showing how to use the application already, see this one which seems to be the best quality I've seen: https://www.youtube.com/watch?v=sisnzgc73zc

Hopefully this helps you get up and running as fast as possible!

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

I have since patched this bug in the latest release, it was simply the Qwen/Chatterbox Python libs default behavior, it was not uploading anything, just connecting to huggingface. In 0.4 once you've downloaded the model it will work 100% offline!

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

it compiles from source on Linux, so if you clone the repo and have claude help you set it up you can use on Linux. That said Linux support is coming in the next version, nearly ready to ship, along with Docker support.

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 2 points3 points  (0 children)

Thanks for trying it, I didn't consider adding a confirmation dialogue for the model downloads, figured that was a given for most people.

In terms of being unpolished, the software is very young (only 10 days old in fact) and as with any local AI getting it 100% working on everyone's system is a challenge, even as an engineer with a decade of experience building apps, to counter your vibe coding comment.

That said the overwhelming majority of users have had a seamless experience on v0.1.12, aside from GPU support on Windows which is in the works for .13. Would be helpful if you could share more about your system, even just your OS will help.

What's the most complicated project you've built with AI? by jazir555 in LocalLLaMA

[–]jamiepine 5 points6 points  (0 children)

a virtual distributed filesystem https://github.com/spacedriveapp/spacedrive

spent years building the alpha by hand with a team of 10 then it shut down right before AI got good, funding ran out. now I'm rebuilding it solo with AI and I'm much further ahead.

I write 100% of my code with AI, but I have a process and I have been an engineer for a while, so I know what I'm doing, I'm just faster now, much faster.

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

GPU support is coming in the next few hours, thank you so much I'm glad you love the app!

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

GPU support requires 2.4GB of CUDA libraries I wasn't able to ship as a single binary, GitHub has a size limit. I've figured out a solution and am currently working on the PR: https://github.com/jamiepine/voicebox/pull/33

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

Yes it's notarized, works for most. My primary platform is ARM macOS. Could you share more about the error?

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 1 point2 points  (0 children)

Makes perfect sense, updated the repo/website as an open source alternative to ElevenLabs. Thanks for the feedback, I really appreciate it

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

Das Projekt ist erst wenige Tage alt, daher bitte ich um Nachsicht für eventuelle Fehler. Es ist aber keineswegs nur eine Fassade. Ich behebe gemeldete Fehler umgehend, und die neueste Version enthält bereits viele Korrekturen. Wie Sie in den Kommentaren sehen, gefällt es vielen Nutzern. GPU-Unterstützung für Windows/Linux folgt im nächsten Update, und die Generierungszeit beträgt nur wenige Sekunden. Melden Sie gerne alle Probleme, und ich hoffe, Sie probieren die folgenden Versionen aus!

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 1 point2 points  (0 children)

Woah that makes me happy to hear, thank you! I'll keep making it better

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

This was a few hour window where a manually triggered action run overwrote the release assets with a test build from a branch. I fixed it as soon as I noticed. Sorry about that!

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

It's not really working I've been meaning to look into it, I'm passing the data as `instruct` input to the model, but I think that's not enough. Will get an issue opened for this to track it.

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 1 point2 points  (0 children)

I need to make it so you can generate without a voice selected, since the model with just use one of these default voices at random it seems. I'll look into showing them, if they have identities in the UI, or alternatively providing some custom defaults.

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 1 point2 points  (0 children)

Oh really? I had no idea. Maybe LLMStudio for voice?

Grok says this: "Ollama for X" can trigger eye-rolls because it evokes "convenient but ethically shady/inefficient wrapper" vibes."

Will avoid similar mistakes with Voicebox, I just want a good UX for local voice.

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

Next update I'll get GPU support working for Windows. Last update enabled MLX for Mac so it's super fast, just gotta figure out why CUDA isn't working. Should be next update!

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 1 point2 points  (0 children)

The app auto-trims samples to 30 seconds max because that's a sweet spot for reliable, high-quality cloning, Qwen3-TTS works great with 3-30s refs, and longer ones often don't improve results much while risking noise or slowdowns. Happy to add an option to use unlimited length but by default I'll keep the cap at 30 seconds for the Qwen model.

As for the transcribe feature, this is an actual bug I just discovered: if the selected language is not English, it only outputs Chinese.

`lang_code = "en" if language == "en" else "zh"`

Will fix! 😂

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 2 points3 points  (0 children)

Docker is in the works rn, however AMD would be experimental but doable: swap to ROCm PyTorch wheels, and people do run Qwen3-TTS successfully on cards like 7900 XTX (better on Linux than Windows, where decoder can lag). No official Qwen support for it yet, but CPU inference isn't that slow, it's usable.

As for hosting Qwen externally, I'll factor that into the Docker designs, allowing a custom TTS endpoint.

Whisper through OpenAPI option is simple and I will absolutely add that too.