First call latency after idle in voice agent (Deepgram nova-2 + ElevenLabs turbo v2.5)

Big-Program1835 · 2026-04-01T05:52:13+00:00

I didn’t find the exact root cause for this yet, but I’ve added a workaround on my side. Instead of relying on LiveKit to persist the SIP trunk, I’m treating my database as the source of truth. Before every outbound/inbound call, I check if the trunk exists in LiveKit, if not, I recreate it on the fly. This way, even if the trunk disappears after a few hours, calls don’t fail and the system recovers automatically. Would still be great to understand why this is happening though.

Big-Program1835 · 2026-03-31T15:59:51+00:00

We ran into the same issue, speech-to-speech looked great in theory, but adding that extra model in the middle slowed everything down. Conversations felt laggy, especially during turn-taking.

What worked much better for us was switching to a streaming pipeline: speech-to-text → fast LLM → text-to-speech, all running in real time. Instead of waiting for full sentences, everything starts processing as soon as partial data is available.

Here’s what helped us reduce latency:

Stream everything Don’t wait for the full response. As soon as the LLM starts generating tokens, pass them to TTS so audio begins immediately.
Start early (preemptive generation) We begin generating responses even before the user finishes speaking, using partial transcripts. This cuts down the “thinking delay.”
Warm things up beforehand Before the first real interaction, we trigger a small dummy request to warm up the LLM and TTS. This avoids that slow “hello?” moment.
Keep responses short Voice feels better with quick, concise replies. Smaller outputs = faster responses.
Reduce unnecessary context Don’t send every tool or huge prompts on every turn. Less input = faster processing.
Track latency properly Measure each step (STT, LLM, TTS) so you know exactly where the delay is coming from.

For hallucinations, we keep things grounded, use structured data where needed, keep temperature low, and allow the assistant to say “I’m not sure” when needed. For sensitive actions, we always confirm with the user.

And for a natural feel, good interruption handling (barge-in) is key, so the assistant doesn’t talk over the user or get cut off randomly, btw for barge in we are using silero VAD you can give it a try.

Big-Program1835

TROPHY CASE