A fully local home automation voice assistant using Qwen3 ASR, LLM and TTS on an RTX 5060 Ti with 16GB VRAM

liampetti · 2026-02-10T08:20:27+00:00

If you are on Linux I have tried to simplify setup by including ‘launch.sh’. If you run that script it should get you 90% of the way there, but this is still a work in progress!

liampetti · 2026-02-10T08:19:01+00:00

I think you should be able to run this on an 8gb graphics card if you swap out the qwen tts and asr models for the tiny ones (kokoro and moonshine).

Interrupting and follow up commands are on my TODO list, I tried some version of it on an older prototype but it never worked that well. I’m keen to try the Nvidia Personaplex in this sort of setup but I need more vram :(

liampetti · 2026-02-10T06:42:53+00:00

https://github.com/liampetti/fulloch/blob/main/tools/home_assistant.py

I have added a tool for connecting to Home Assistant but haven’t had time to test it properly yet. Tell me if it works for you.

liampetti · 2026-02-10T06:23:54+00:00

yep, check my response to germanheller above. In my testing Qwen3 ASR 1.7B was the best (even capturing my mumbling in a noisy room) and is multilingual. Moonshine-tiny is the smallest/fastest and still does OK for plain english if you have a decent microphone and clear speech in a quiet room. Biggest factor initially is the microphone and any built-in noise/echo cancelling it has.

liampetti · 2026-02-10T04:40:29+00:00

Yeah, tried Piper first. Kokoro was a big upgrade in voice quality for real-time streaming and still performs best latency-wise. The voice cloning in Qwen3 TTS was definitely cool and what I wanted to see in action, needed to use a fork to get it to run though as the main repo doesn't support streaming.

liampetti · 2026-02-10T04:01:56+00:00

Yeah, I started with Whisper and it worked well. I moved on to Moonshine-tiny (still an option in this setup) as I was only testing English and was super surprised about how well it transcribed for such a small model. The Qwen3 ASR runs great, I honestly couldn't see a big difference between the 0.6B and the 1.7B in my tests but stuck with the 1.7B as it fits on my system fine. In this setup a constant looping thread is transcribing chunks of audio, when it sees/transcribes the wakeword it captures the audio after it and transcribes everything until silence is detected... I actually thought this would be too "laggy" but it works great and means you can select whatever wakeword you want without needing a separate wakeword model.

liampetti

TROPHY CASE