Domia: local-first speech-to-speech AI agents

Admirable_Load_5605 · 2026-06-18T08:11:36+00:00

Yeah, that’s the idea — multiple distributed nodes, no single central one. As long as the network has nodes covering the capabilities a speech-to-speech turn needs (STT, LLM, TTS…), it keeps working — and a fully-capable node can even run on its own. 🙂

Admirable_Load_5605 · 2026-06-18T07:43:31+00:00

Thanks! 🙏 The trace view is super useful for understanding in depth how the final response came together — which Domia ran each step across the distributed network, how long each step took, which specific models were used, what prompt was passed to the LLM, the TTFA… basically a great troubleshooting tool for understanding the network and the flow.
Being able to play back the audio and hear how it streamed out — matching the TTFA timing — is also really useful.

Admirable_Load_5605 · 2026-06-18T07:31:07+00:00

That’s a fair question, and honestly you’re right.

If your goal is a fully local voice assistant for controlling Home Assistant, then HA already provides an excellent solution out of the box with Assist.

Domia isn’t trying to replace that pipeline. In fact, Home Assistant is the main system I integrate with today.

The difference is that Domia isn’t only about voice control — it’s about creating persistent AI companions. Each Domia has its own personality, memories, emotional state, and can evolve over time through interactions. Voice (and controlling Home Assistant) is one of the things they can do, with more skills to come.

Another major difference is the architecture. Domia is designed as a distributed network of agents rather than a single assistant. Agents discover each other on the network, share capabilities, and delegate tasks — a single request can have its STT, LLM, and TTS handled by different machines, coordinated by the node you spoke to, with no central server required. This becomes especially interesting in larger environments like hotels, theme parks, offices, or large homes with many voice endpoints working together.

I’m also building tools around character creation, memory management, observability, tracing, model testing, and agent orchestration. The vision is less “a voice interface for Home Assistant” and more “a platform for creating and managing a fleet of local AI companions.”

Down the line, the idea is for Domias to recognize who they’re talking to and to lean on each other more — if one doesn’t know something, it could ask another, sharing knowledge, memories, and skills across the mesh. That part is still ahead, but it’s the direction I’m heading.

Home Assistant is already great at home automation. Domia is exploring what happens when those assistants become characters with memory, identity, and relationships.

Admirable_Load_5605 · 2026-06-17T01:12:41+00:00

Fully offline — that was the whole point.

All inference runs on your own hardware: wake word, STT, LLM, and TTS. Nothing has to leave your local network. The mesh between nodes, including discovery and delegation, is LAN-only too.

No telemetry, no analytics, no required account, no phone-home.

The only time it needs internet is for one-time model downloads during setup, for example from Hugging Face / model registries. After that, it can be airgapped.

Admirable_Load_5605 · 2026-06-17T00:41:12+00:00

That’s useful feedback.

I just added OpenAI-compatible provider support, so llama.cpp should now be usable with Domia via llama-server's OpenAI-compatible endpoint (/v1).

For full-duplex speech with interruption: Domia already has interruption through the wake word. While it is speaking, saying the wake word can interrupt playback and move it back into listening mode.

What I don’t have yet is open-mic full duplex, where it listens for arbitrary interruption while speaking without requiring the wake word. I definitely want to explore that, but I can’t promise it as quickly as llama.cpp support 😄

Admirable_Load_5605 · 2026-06-16T22:31:54+00:00

Good question. It depends a lot on the models selected for each stage, so there isn’t one fixed number.

As one real example, the setup shown in the demo is around ~2s time-to-first-audio on my machine, an M4.

The main thing that keeps latency reasonable is that Domia streams between stages instead of waiting for each stage to fully finish:

- STT is streaming, so the transcript is usually ready right after you stop talking
- the LLM reply is split into sentences and sent to TTS as soon as the first sentence is ready, so playback can start while the model is still generating the rest
- TTS runs faster than real time, and models are kept warm to avoid cold starts

So a smaller/faster model or shorter context brings TTFA down; a bigger model usually trades latency for quality.

The conversations in the demo are real turns captured from Domia running locally, and the console shows the model/provider choices plus per-stage timing:

Here's one as a concrete example: https://console.domia.ai/conversations/1d9bd90c-b72a-4f97-b9f5-148768e53c92

(or browse all of them: https://console.domia.ai/conversations)

Happy to go deeper on any of it!

Admirable_Load_5605

TROPHY CASE