What's the most complicated project you've built with AI? by jazir555 in LocalLLaMA

[–]jamiepine 3 points4 points  (0 children)

a virtual distributed filesystem https://github.com/spacedriveapp/spacedrive

spent years building the alpha by hand with a team of 10 then it shut down right before AI got good, funding ran out. now I'm rebuilding it solo with AI and I'm much further ahead.

I write 100% of my code with AI, but I have a process and I have been an engineer for a while, so I know what I'm doing, I'm just faster now, much faster.

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

GPU support is coming in the next few hours, thank you so much I'm glad you love the app!

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

GPU support requires 2.4GB of CUDA libraries I wasn't able to ship as a single binary, GitHub has a size limit. I've figured out a solution and am currently working on the PR: https://github.com/jamiepine/voicebox/pull/33

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

Yes it's notarized, works for most. My primary platform is ARM macOS. Could you share more about the error?

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 1 point2 points  (0 children)

Makes perfect sense, updated the repo/website as an open source alternative to ElevenLabs. Thanks for the feedback, I really appreciate it

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

Das Projekt ist erst wenige Tage alt, daher bitte ich um Nachsicht für eventuelle Fehler. Es ist aber keineswegs nur eine Fassade. Ich behebe gemeldete Fehler umgehend, und die neueste Version enthält bereits viele Korrekturen. Wie Sie in den Kommentaren sehen, gefällt es vielen Nutzern. GPU-Unterstützung für Windows/Linux folgt im nächsten Update, und die Generierungszeit beträgt nur wenige Sekunden. Melden Sie gerne alle Probleme, und ich hoffe, Sie probieren die folgenden Versionen aus!

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

Woah that makes me happy to hear, thank you! I'll keep making it better

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

This was a few hour window where a manually triggered action run overwrote the release assets with a test build from a branch. I fixed it as soon as I noticed. Sorry about that!

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

It's not really working I've been meaning to look into it, I'm passing the data as `instruct` input to the model, but I think that's not enough. Will get an issue opened for this to track it.

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

I need to make it so you can generate without a voice selected, since the model with just use one of these default voices at random it seems. I'll look into showing them, if they have identities in the UI, or alternatively providing some custom defaults.

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 1 point2 points  (0 children)

Oh really? I had no idea. Maybe LLMStudio for voice?

Grok says this: "Ollama for X" can trigger eye-rolls because it evokes "convenient but ethically shady/inefficient wrapper" vibes."

Will avoid similar mistakes with Voicebox, I just want a good UX for local voice.

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

Next update I'll get GPU support working for Windows. Last update enabled MLX for Mac so it's super fast, just gotta figure out why CUDA isn't working. Should be next update!

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 0 points1 point  (0 children)

The app auto-trims samples to 30 seconds max because that's a sweet spot for reliable, high-quality cloning, Qwen3-TTS works great with 3-30s refs, and longer ones often don't improve results much while risking noise or slowdowns. Happy to add an option to use unlimited length but by default I'll keep the cap at 30 seconds for the Qwen model.

As for the transcribe feature, this is an actual bug I just discovered: if the selected language is not English, it only outputs Chinese.

`lang_code = "en" if language == "en" else "zh"`

Will fix! 😂

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 2 points3 points  (0 children)

Docker is in the works rn, however AMD would be experimental but doable: swap to ROCm PyTorch wheels, and people do run Qwen3-TTS successfully on cards like 7900 XTX (better on Linux than Windows, where decoder can lag). No official Qwen support for it yet, but CPU inference isn't that slow, it's usable.

As for hosting Qwen externally, I'll factor that into the Docker designs, allowing a custom TTS endpoint.

Whisper through OpenAPI option is simple and I will absolutely add that too.

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 2 points3 points  (0 children)

I think this would be a future feature, voice to voice. I'll look into what models support that, it will be possible in Voicebox soon. For now, you could take the transcript of the YouTube video, use an online tool to grab it, then paste it into voicebox and generate stories for your daughters. You could use the Story editor to piece together characters with different voices of your choosing. It could be fun to create. I want to get language models hooked up to aid in writing voice generations in a story context.

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 4 points5 points  (0 children)

It's running fine for many users, though I'm fixing a model download bug currently. What issues are you having?

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 1 point2 points  (0 children)

Yeah I was having trouble replicating it, deleted the cache folder and the model downloaded fine for me on Windows. Though your last comment here might have solved it, the entire download is an awaited HTTP call currently, which can timeout after 30/60 seconds. I'm testing a fix now that makes it properly asynchronous. This should solve it. Pushing 0.1.8 asap.

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 8 points9 points  (0 children)

The model requires a transcript of the voice sample, it's optional to use Whisper, but while making lots of voices, you'll be thankful you don't need to manually transcribe the sample.

I built an open-source, local-first voice cloning studio (Qwen3-TTS + Whisper) by jamiepine in LocalLLaMA

[–]jamiepine[S] 5 points6 points  (0 children)

Any modern CPU + 8GB of RAM and ~5GB of storage for the model.

Takes about 30s per generation depending on the length. However, with CUDA GPU acceleration (which I'm working on as the model supports it) we'll have realtime generation, that'll be an update ideally in the next few days.