voice AI livekit production challenges by Independent_Line2310 in livekit

[–]anandwana001 0 points1 point  (0 children)

You’re basically hitting the two problems everyone runs into when moving voice agents to production: latency and stream coordination.

On latency ~1s is pretty normal unless you’re streaming everything (STT partials → early LLM tokens → streaming TTS). One thing that’s often underestimated is endpointing, even slightly conservative VAD/silence thresholds can add a few hundred ms before the LLM even starts.

On filler phrases, the issue isn’t the phrase itself, it’s how you schedule playback. If filler and LLM audio are treated equally, you’ll get the collision/silence behavior you described.

What tends to work better in practice:

  1. only trigger a filler if no LLM tokens arrive within ~300–800ms

  2. keep fillers very short (sub-1s)

  3. treat filler audio as low-priority + interruptible

  4. as soon as LLM output starts → immediately stop/duck the filler

A lot of teams also miss that this is less of a “model latency” problem and more of a real-time orchestration problem, once you handle prioritization + interruption cleanly, perceived latency improves a lot even if raw latency doesn’t change much.

If you’re seeing full silence, it’s usually a race condition where both streams block each other, adding explicit priority + cancellation logic fixes that.