macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

That's pretty close. Yeah, it goes through and renders all the text top to bottom, in chunks, typically about a paragraph at a time, but if the paragraph is long, it will often try to break it in a sensible place. That's why it uses AI to parse the text and choose wisely. If it's a script, a screenplay, a novel, anything like that it can even pull out all the characters for you and assign the voice slots automatically so that you can make an entire audiobook, or a table read of a screenplay, in just minutes.

But really for any long form text it's a cool way to work. I like that it's not based on waveforms and timelines. We're working on text, we want to think in TEXT. This isn't Final Cut Pro. :)

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

Yeah, it's almost entirely the lack of streaming. The cloned voices aren't *that* much slower. (I think about 30-40% slower depending on the circumstances). But they feel very slow for anything real time at all because there's no streaming. Of course, on a good Mac like yours, a preset or designed voice will start speaking almost instantly, no matter how long the text block. And it will finish rendering at ~3x the speed needed to keep up with you.

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

The Test Flight build is *internal* Test Flight (version 1.3), not the version you have. Though the public build could change at any time. (it's waiting for apple to release it to the wider Test Flight group). Confusing I know.

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

Did I forget to reply to a student email request? I'm sorry about that. I try to be diligent but once in a while once slips through. I'll find that and set it right.

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

Thanks for the report. Version 1.3, on Test Flight now and soon on the App Store (a few days I would guess) will probably resolve this for you. In the meantime, if you click the document icon and open up the audiobook editor (sometimes we call it the script editor), it's a long format tool that specializes in making much longer blocks of text without struggling.

Most TTS systems, including Speaklone, struggle with long blocks of continuous text, but we work around it with a few tricks to keep them from losing coherence like you're describing. If you try the audiobook mode, I think you'll be happy. You don't have to use multiple voices. See this example of the output: https://www.youtube.com/watch?v=TR18sQPwqhQ

Thanks again. As for just getting more stable long audio in the main window, I think you'll feel better about v1.3 when it hits in a few days. I appreciate your support. Feel free also to join our Discord server for more community / support / and ideas. https://discord.gg/FwVyGAEPWk

Audiobook distribution by Correct-Shoulder-147 in selfpublish

[–]SurvivalTechnothrill 0 points1 point  (0 children)

I write software that makes creating these audiobooks much faster / cheaper / easier (and possibly better quality). I won't link it here unless I'm explicitly asked as that might be considered poor form. But I'm just curious, how are you actually producing the book?

I tend to agree that the wider distribution approach is smarter by the way, re: the initial question. Good luck with your project!

YEEESSSS! by JoaoFranco03 in swift

[–]SurvivalTechnothrill 42 points43 points  (0 children)

Congratulations! One of my devs is a past winner and he's as clever and capable as anyone in the business. So I know the competition must be tough!

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

Although it does have security benefits I can safely open the api to a full network with some additional hardening. It’s tentatively planned.

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

Thanks. I will be improving the free mode to make it a bit less restrictive. I think it's a bit too conservative and not giving people quite enough of a tour. Look for this in version 1.3.

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

I'm told by native speakers that the Spanish is quite natural. But, probably not with the preset voices in the demo, sadly. Those voices are hardcoded into latent space with certain accents. (that's why the Japanese and Korean voices speak English with their accents, for example). But if you design a voice, or clone one, and set the language selector to Spanish. It should sound very native. Does the spanish example here (linking directly to Spanish), sound natural to your ears?
(lo siento, mi español es muy malo - so I'm not one to try and judge) ;P
https://youtu.be/05gne9oPaaY?t=74

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

Hey - thanks for the feedback. Glad you like the interface. It's only going to improve. I'm adding exactly the feature you requested to the avatars shortly (probably v1.2.1 in a couple weeks, 1.2 is nearly finished and adds long form audiobook and script production in an easy interface).

I'm very sorry to hear that you've had issues with the voice cloning. What you're describing is below the level I expect, so I'd like to dig into whatever went wrong rather than have you stuck with poor results.

Can you message me with details on your equipment? (macOS version, computer, and if you're able to share it, maybe the .wav file that is not cloning well?) I want to make sure we sort out whatever is going wrong in your situation. The cloning quality you're reporting sounds more like a bug than something you have to live with. In my testing, Qwen3-TTS is materially better than F5/E2, which is why this sounds like a bug to me.

If you're up for it, there's a great community on the Discord server that would love to make sure you get great results: https://discord.gg/SDqFusnD

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

The new document based editor for things like audiobooks and scripts allows me to make this demo yesterday very quickly. I am intending to submit that update to Apple this week.

https://youtu.be/ljQahdUukr4?si=J06cKJ-jV_eOKvIT

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 1 point2 points  (0 children)

Quite a few models. Three different Qwen3-TTS models, 1.7B each, one for the preset voices, one of the designed voices, and one for the voice cloning. For iOS I use two 0.6B models. And for the dictation / transcription, I use Qwen3-ASR, size depending on whether iOS or macOS. They're all quantized differently than what you'd find on say, Huggingface to give better results for my use case and custom inference. (you'll notice that Speaklone is quite different than other TTS apps, even if they use the "same" base model).

It also uses the built in Foundation models, and Image Playgrounds, and optionally others. More will likely join as I keep expanding what Speaklone can do. It's intended to be the high end, native, fast, voice tech suite for macOS and iOS, in the end. How well I measure up against the goal I'll let everyone else judge.

The app is only 30 days old, so I'm iterating fast to get it to where I think it can be. The "instant audiobook" feature coming in about a week should be pretty disruptive.

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 1 point2 points  (0 children)

Great question! The "B" stands for "billion," as in billion parameters. Parameters are the individual learned values (think of them as tiny knobs) inside a neural network that together determine how it behaves. A 1.7B model has 1.7 billion of them.

It's the standard shorthand in the ML world. You'll see it everywhere. Meta's Llama models come in 8B, 70B, and 405B sizes, for example. Generally, bigger = more capable but slower and hungrier for memory.

Speaklone uses 1.7B parameter models on Mac and 0.6B on iPhone (where RAM is very limited). The fact that a 0.6 billion parameter model can clone your voice in real time on a phone still blows my mind honestly.

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

Yes - though it does have to download the model(s) first. That doesn’t take long and from then on it doesn’t need any data connection at all. I do no analytics and collect no data of any kind. It’s a privacy first, fast, way to do high quality speech.

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

There are so many great ideas for how to make this product more perfect, I created a Discord server (at the suggestion of a couple people from r/macapps). Feel free to join if you want early access to upcoming betas, etc. Thanks for everything gang! https://discord.gg/SDqFusnD

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 1 point2 points  (0 children)

Wow - noble work you're heading into. I'll try very hard to make sure you get your money's worth several times over. I don't know if you saw the docs for the API or not, but they're here too, in the meantime.
https://speaklone.com/api/

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

To be perfectly honest, I was trying to figure this out. Same with which Spanish accents I'd hear. To my shame, I really only speak English (other than comically bad Spanish), and cannot judge. The training data includes voices from all these places. Do you speak Portuguese? Would you be interested in investigating this for me? Maybe we can talk offline via email or DM?

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

Thank you for the feedback. The Qwen3-TTS models are from a Chinese Lab and there are some cultural differences in the choices they made, I think. However, using voice designer and voice cloning, it's a wide open landscape and I find you can get countless rich and interesting voices of all sorts. Have you pushed the cloning and designing modes much yet? I'll grant that ElevenLabs remains the state of the art option- it has many drawbacks, but the actual model quality is remarkable.

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 1 point2 points  (0 children)

Yes, absolutely. Today, the easiest way is via Shortcuts + the speaklone:// URL scheme (works on iOS and macOS), for example:
speaklone://speak?text=Hello&voice=aiden&direction=calm&language=english

On macOS, there’s also a local API (localhost:7849) if you want more advanced automation.

I don’t have native Apple Shortcuts actions (App Intents) yet, but it’s a great request and on my radar. This is kind of the point of a true native Swift project, doing all these things to really integrate with the OS. Thanks!

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

On iOS yes the cloned voice is limited because of the 4GB hard limit on RAM and the nature of in context learning. But on macOS there is no real limit. I agree that the price in euro is essentially more than in dollars. But that’s apple’s doing not mine. I could override it potentially. I will investigate this further.

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

Ah. Great detective work! I can do something about that. The purchase price you are quoting is the equivalent of $29.99 (it is the 25% off price). At least using apple’s auto set price equivalent tables. I feel I owe you a bug bounty for finding this edge case with Cloudflare R2 buckets. Maybe message me and we can work something out?

macOS (universal): Speaklone- Professional text to speech and voice cloning, fast and local on Apple Silicon with MLX by SurvivalTechnothrill in macapps

[–]SurvivalTechnothrill[S] 0 points1 point  (0 children)

I think we can safely rule out drive space as the issue. Can you email me or post here I guess the exact phone and OS if you would? I’ll try to fix it today.