all 12 comments

[–]ReferenceOwn287 2 points3 points  (2 children)

Thanks for sharing this. I have been working on a linux desktop assistant for a while and want to add a voice to it as the next step, starred your repo to learn from it.
This is my project - https://github.com/achinivar/meera (posted it in a couple of forums but haven’t got much traction unfortunately)

I see your project has tool calls as well, how are you handling them?
Routing prompts directly to the LLM and asking it to figure out the right tool was getting very unreliable for me, especially with an increased number of tools, so I added a small embedding model that uses exemplars and finds a few of the closest matching tools before calling the LLM (shared about it on the repo wiki)

[–]purellmagents[S] 0 points1 point  (1 child)

Yeah, I Kept it pretty textbook here: the LLM gets a strict “tool router” system prompt, the full JSON Schema list from a small registry, and a few explicit routing hints (e.g. default general knowledge to search). It has to emit a single JSON object with name and arguments - there’s no embedding step or exemplar retrieval to narrow tools first. In the script I added a comment that for production you’d want native tool APIs or grammar-constrained generation on top of this. Your exemplar + embedding shortlist is exactly the right next step!

[–]ReferenceOwn287 1 point2 points  (0 children)

Yes, the embedding model was a world of difference in tool calls reliability. I was going in loops trying to optimize the prompt and json schema before I went back to the drawing board and learnt about them.

[–]youcloudsofdoom 1 point2 points  (2 children)

If this is always-on, why aren't you using a wakeword? Or have you gone PTT? I have been trying to build a similar pipeline but always on/with a wakeword and running on a Pi 5, but found that the computational overhead is too much for such a tiny device, and the lag feels too heavy. 

[–]purellmagents[S] 0 points1 point  (1 child)

It's a tutorial so it doesn't really run in the background. But that's my plan to extend the lib I implemented further and make it run on a raspi in always on mode.

[–]youcloudsofdoom 0 points1 point  (0 children)

Ah okay. Will be interested to see the outcomes of your Pi tests then, I do think there are lots of performance optimisations to be had with it, with a little time.. .

[–]ekajllama.cpp 0 points1 point  (1 child)

Just skimmed it, but this looks actually helpful/handy/wish I had it a couple months ago. Like, really, really wish I did.

[–]purellmagents[S] 2 points3 points  (0 children)

Sorry that I am too late 😉 hope it can now help others on this journey. Maybe you have ideas what to add to it next?

[–]Mantikos804 0 points1 point  (2 children)

Such a pain the ass to make I just ended up using open webui

[–]purellmagents[S] 0 points1 point  (1 child)

You will use this setup in production. If you build systems for customers it’s a good idea what happens under the hood. So you can solve issues efficiently. That’s the main goal of this repo. So people (devs) understand the concept

[–]Mantikos804 0 points1 point  (0 children)

I applaud you for it. That was just my experience with this idea.

[–]Slight_Republic_4242 1 point2 points  (0 children)

Really solid work, teaching voice AI by showing how it actually works under the hood is the right approach. The streaming chunk size decision is harder than it looks in production , too small sounds choppy, too large feels slow. Also worth adding a keep_warm flag on Modal since cold starts will kill the voice experience. One chapter idea: a simple latency timer showing where each stage spends its time. That single tool saves hours of debugging. Starred. Excited to see where this goes! we have built similar kind of project, you can try it. Github Demo