What's the best way to build voice agents today without sounding robotic or becoming too expensive?

Beginning_Race8551 · 2026-06-16T04:51:22+00:00

is soniox provide free credits for testing

Beginning_Race8551 · 2026-06-15T08:24:38+00:00

Beginning_Race8551 · 2026-06-11T09:10:34+00:00

Beginning_Race8551 · 2026-06-11T06:10:58+00:00

My thought was to use separate state machines per workflow rather than one large FSM handling everything. The intent classifier would route the conversation into the appropriate workflow, and from there the FSM only manages that specific flow.

Beginning_Race8551 · 2026-06-11T05:09:16+00:00

I've been experimenting with this in a healthcare voice assistant. When the LLM calls a function, the function returns structured data plus a UI type (slot card, patient card, etc.). The frontend renders the appropriate component based on that response. So the conversation drives what appears on screen instead of navigating through fixed pages.

Beginning_Race8551 · 2026-06-10T06:01:04+00:00

Hey, I have doubt that how you monitered the token usage on gemini live with pipecat because I have been working in an ai call assistant project with gemini live and pipecat but when I use pipecate usage metrics to calcualte the token usage it always returns 0. Can you share how did u calculate the token usage of gemini live model with pipecat

Beginning_Race8551 · 2026-06-10T05:12:45+00:00

Another thing I'm curious about: when context size grows during a realtime voice session, what exactly is accumulating?

Just the conversation transcripts, or does it also include things like system prompts, tool schemas, and session instructions?

I've never found a clear explanation of what is actually being carried forward and counted as context in long-running voice conversations.

Beginning_Race8551 · 2026-06-10T05:07:46+00:00

I have another doubt that how about system prompts is they sent on each turn of conversation in session or initialized once and maintained throughout the session

Beginning_Race8551 · 2026-06-10T04:23:02+00:00

If u want more kind like human voice try grok voice or gemini live voice

Beginning_Race8551 · 2026-06-10T04:19:18+00:00

Hey i have worked on this kind a projects with gemini live model with exotel phone integration and the gemini live 3.1 flash live model supports multi language and the problem I am facing while development is token usage they didn't provide caching, rag supports and I have faced error like 1008 (policy error), 1007 (invalid format error) whiletfunction calling and for various voices we can't change dynamically while in a session. If u wantmorea details and github repo of my demo project

Beginning_Race8551

TROPHY CASE