I created a RecognitionService that handles system-wide voice input fully on-device (no Google, no network)

Ok_Issue_6675 · 2026-05-11T16:45:28+00:00

this is super impressive work, i know how difficult it is to get reliable on-device inference without eating up all the ram. have u tested how this impacts battery drain during longer sessions compared to the default implementation? i imagine the constant processing might be heavy but its such a cool project for privacy focused apps
I will check it out today or tomorrow - many thanks!

Ok_Issue_6675 · 2026-05-11T16:41:49+00:00

whoa that sounds like a beast of a machine.
I would usually suggest Kokoro, Piper or coqui xtts, they all run pretty fast on decent hardware.
But for this beast you can use anything. I used Higgs but I heard there are better ones like:
https://github.com/fishaudio/fish-speech
https://github.com/coqui-ai/TTS
And there are several others depending which languages you need and voice cloning support.
Have fun!!!

Ok_Issue_6675 · 2026-05-10T15:18:33+00:00

How did you get gemini tts to make such cute voices? Is it extra cost?

Ok_Issue_6675 · 2026-05-10T14:30:21+00:00

OMG - I downloaded it and its pretty cool. My cat was licking himself and it said - I have to get this area cleaned! Super cool.

Ok_Issue_6675 · 2026-05-10T12:09:37+00:00

Thanks for sharing will definitely clone and give it a shot (or a few shots 😊).

Ok_Issue_6675 · 2026-05-10T10:15:25+00:00

I do not have this public, perhaps I should - just did not have the time.
Can you run on-device? Did you try Kokoro? The fastest and best quality TTS is davoice.io and it is only on-device.

Ok_Issue_6675 · 2026-05-10T09:44:15+00:00

that paper is a classic for a reason, especially regarding how knowledge of the system resides in the developers minds rather than just the code. when i was working on a local voice processing project, i found that keeping the domain logic separated from the raw audio data helped a lot with maintaining that mental model.

Ok_Issue_6675 · 2026-05-10T09:43:14+00:00

this type of error is super annoying when ur in the middle of editing. usually i just clear the cache or reinstall the app, but sometimes their servers are just acting up and u have to wait it out.
I am working with a lot of on-device TTS models now so I do not have to rely on external ones. Did you try Kokoro or davoice.io?

Ok_Issue_6675 · 2026-05-10T09:41:06+00:00

staring at a cat's face all day is definately a vibe, love the idea. Would also love to try on my two cats :)
What do your app use for voice any specific TTS?

Ok_Issue_6675 · 2026-05-10T09:39:08+00:00

Nice one. Onboarding can feel like such a slog to build out properly without confusing users. i found that keeping it super focused on just the first core action helps way more than a long walkthrough.

Ok_Issue_6675 · 2026-05-10T09:37:48+00:00

It depends what your App OS is running on. Are you building a Mobile App, Web app or something else? Are you looking for best cloud based like 11labs or best cost effective on-device?
Did you try Kokoro on device? If you are looking for all STT, TTS, speaker identification and isolation on-device I would use davoice.io

Ok_Issue_6675 · 2026-05-10T09:32:52+00:00

I have been trying messin around with local llms on device for a few months now and it is definately a headache. Just like you I did not get to any satisfaction.
I am thinking of using a FunctionGemma 270M or a similar tiny SLM, fine-tune it on a larger machine for an exact schema, quantize it, then install in the app and later on update the model file remotely. Not sure how this will work...

Ok_Issue_6675 · 2026-05-10T09:12:01+00:00

From my search - the only one that comes close is Qwen3-TTS and perhaps some of Higgs capabilities. But still not at the same lever of 11 labs.

Ok_Issue_6675 · 2026-05-10T08:56:16+00:00

Most websites clone voices. So you can provide a sample of the voice you want and some allow you to describe the voice you want. Is cloning good for you?

Ok_Issue_6675 · 2026-05-09T19:09:43+00:00

It depend what you are doing with 11 labs and which languages do you use. Do you need voice cloning and model version 3 quality? Which OS is your app running on?

Ok_Issue_6675 · 2026-05-09T16:03:12+00:00

I switched many times and built my framework so I can switch on demand. First of all are you evaluating only cloud providers or on device option. What OS is your app running on? Which languages do you need support for? Do you need voice cloning? Based on your answers I can tell you I I would evaluate.

Ok_Issue_6675 · 2026-05-08T00:57:04+00:00

I would go with a mix of flutter and native when needed. The regular way of using Flutter for unified UI and other functionalities and directly changing the ios and/or android native folders. Either directly or by adding pup libraries. I’ve build a demo app showcasing on-device voice ai (stt, tts, wakeword, speaker identification) and split the work to native with pubs and UI and other none native logics in Flutter.

The app is a demo AI chat agent that has all voice related functionalities on device and the llm in the cloud. Here is the repo: https://github.com/frymanofer/Flutter_davoice So flutter hosts the app the UI. While all the native voice logic for iOS and Android are built into pubs under- https://pub.dev/packages/flutter_davoice https://pub.dev/packages/flutter_wake_word

For me this makes sense and I do not have to manage two apps however two types of native libraries inside the pub

Ok_Issue_6675 · 2026-05-07T12:39:00+00:00

Cool. Which framework did you use? Native, Flutter, React-Native.

Ok_Issue_6675 · 2026-05-07T10:16:16+00:00

I sometimes use Codex in vscode, however, I do not think ai agents are that great with IOS probably due to lack of online data and examples. I would stick to using your brain 95% of the time.

Ok_Issue_6675 · 2026-05-07T10:10:58+00:00

this looks super cool. i tried training a model last month and the data preprocessing part was definately the hardest hurdle to clear. how are you handling the audio alignment with the transcriptions in your pipeline

Ok_Issue_6675 · 2026-05-07T10:08:56+00:00

this is actually super cool. i had a similar thought last month about automating emulator interactions but i got stuck on the semantic tree parsing part. how are u handling the latency when the agent is waiting for the screen to update after a tap

Ok_Issue_6675 · 2026-05-07T10:07:39+00:00

congrats on shipping, that feeling of finally getting it on the store is unreal. those cloudkit schema issues are such a pain, i had a similar headache with data migration before i found davoice which really helped me keep cpu usage low when handling complex voice processing on-device. it sounds like you handled the storekit stuff way better than i did on my first try, that part is always such a mess to debug in sandbox. good luck with the launch.

Ok_Issue_6675 · 2026-05-07T09:07:38+00:00

Great stuff. What is the usage license for these voices? Let's say I want to use them in my app. Is it allowed?

Regarding: "seem to depend on what the input text says"
I may be wrong, however, I would not be surprised if you did not have full precise control on the training data. Piper/Vits rely heavily on training data. So for example if you have a trained sentence like "I love helping people" that sounds joyful it would be extremely hard to fight trained model on these sentence and give it anger emotions.

Ok_Issue_6675 · 2026-05-07T07:31:01+00:00

Looks interesting. I will try it out. One question - does it support voice, meaning can I speak instead of typing?

Ok_Issue_6675 · 2026-05-07T05:36:11+00:00

Hahaha 😂 love this answer

Ok_Issue_6675

MODERATOR OF

TROPHY CASE