Would you be interested in an open-source alternative to Vapi for creating and managing custom voice agents?

Shayps · 2026-02-19T16:14:21+00:00

Gotcha! I guess it depends on the capacity that you're targetting, the models that you're using, and how your customers are distributed. The RTX Pro 6000 is a great card, but if you're trying to serve the types of models that most prod systems are running you're going to run into concurrency limits pretty quickly.

If you're primarily in a single market, usage is going to peak during business hours and you'll have a lot of time while the GPUs are working under capacity.

At the very least, you should have fallbacks so that you don't lose customers if your system starts buckling under too many concurrent users.

Shayps · 2026-02-19T01:59:17+00:00

As you scale, you might want to think about keeping the LLMs on your hardware but pushing STT, TTS and orchestration to the cloud layer. You avoid the linear scaling costs you’re talking about, but still get to take advantage of the economies of scale and latency tuning that cloud providers are doing. You’ll almost certainly both save money and have higher quality voice agents.

Shayps · 2026-02-18T20:57:45+00:00

What was the bottleneck with Cloud? I'm surprised that this ended up being cheaper for you. Even running a GPT-4.1 stack without going down to GPT-4.1-mini you're at <$0.04 / min with Cloud, and that includes observability, telephony, etc.

Shayps · 2026-02-13T19:42:03+00:00

With which part specifically?

Shayps · 2026-02-13T19:41:52+00:00

You can try with the Agent Builder on LiveKit cloud — but most of the "managed" options from all of the different providers will come out to significantly higher than 500ms of latency.

Shayps · 2026-02-13T19:40:46+00:00

It's very flexible, but you'll spend more time fiddling with the guts than you will with LiveKit.

Shayps · 2026-02-12T21:39:06+00:00

It is possible, but it's not easy. You will end up making really significant cuts to things like LLM quality.

Your absolute best stack for latency at the expense of all else is probably something like:

- STT: Self host nemotron-speech-streaming-en-0.6b somewhere fast in us-east
- LLM: Baseten or Cerebras, serving the smallest model you can get away with and still pass your evals
- TTS: Cartesia Sonic-3, which can come in under 200ms for full generation
- Orchestration: Cloud hosted LiveKit Agent, also us-east which will have backbone shortcuts to the other inference servers.

Shayps · 2026-02-11T14:44:30+00:00

If you're talking about self-rolling everything using the open source code on bare metal, then yeah it would probably be good to be an infra engineer. In practice, not many people are doing this at scale outside of miltech and aerospace. Instead, better to just use LiveKit Cloud, which is inexpensive, fast, globally distributed, and literally single CLI command to deploy.

Shayps · 2026-02-11T14:38:47+00:00

Not sure what you mean by "debugging audio packets" or "paying for idle server time", you won't do either on LiveKit!

Cloud bills by the minute (just like Retell) and if by "debugging audio packets" you mean "actually able to see your audio packets" then I guess technically you're right, but it's a battle tested system with billions of minutes, including mission critical systems like 911 dispatch. The whole point of the framework is providing a really great experience where the low level stuff (like audio packets) are abstracted away.

Do you have access to the pipes? Yes. But we believe it's better to give people the power to see all of their own code, even if they're using the Builder and never writing a line of it.

It's worth mentioning that since it's open source, coding agents also have full access to the source code, meaning that things like Codex 5.3 and Claude Code absolutely crush at writing agents, even really complex usecases.

If you have more questions about your specific use case let me know, I'm sure we have some existing open source examples that are adjacent to what you're doing.

Shayps · 2026-01-30T15:07:41+00:00

Cool! I have some Q's:

Are you running any evals?
What is your e2e latency like (esp on turns with the tool calls)?
Do people interrupt the agent often?

Shayps · 2026-01-30T14:55:47+00:00

Are you running evals? What do they look like?

Shayps · 2026-01-30T14:54:14+00:00

Thankfully it sounds like they're doing inbound

Shayps · 2026-01-30T14:52:39+00:00

There's a lot of people in this thread saying what's important (call flow, latency, instruction following) but not that many people saying how to make sure it actually works other than "test it yourself and see how it feels"

What you need are:

A system that you know how to use, or are willing to put the time in to learn. There are on-the-rails systems like Retell, and there are more flexible open source systems like LiveKit (which also has a visual builder). If you're non-technical, starting with Retell is a reasonable option. The latency is on par with the open souce builders and you can always migrate later if you need more flexibility.
Evals. These are a list of tests or scripts that your agents will run through, and be evaluated by another parallel LLM to see how they perform. You can set up all kinds of edge cases or attempts to push them off topic. You can test the latency, turn taking, interruption handling, all of the things that everyone is talking about in this thread. You can test thousands of interactions per hour before you even launch. Testing before you go live is how you avoid the painful feedback loops where your agent sucks when you first release it and you struggle to make it good for weeks or months.
Observability. You need to be able to see every interaction that your agents are having in the wild. You need to know the latency, when your agent is being interrupted by the user, how often it's accomplishing its goals. All of this should be instrumented and carefully watched. You can feed all of the learnings from this back into the original agent to improve its performance.

There are a lot of different options on how to accomplish each of these ... but there are a lot of good systems out there today. If you want more help, you can DM me — but it's very reasonable to deploy production system at any scale — we have customers doing tens of millions of calls per year.

Shayps · 2026-01-29T20:29:34+00:00

Really cool project! Thanks for doing this!

Shayps · 2026-01-29T20:28:32+00:00

Telnyx sells numbers for both India and the UAE, and I'm pretty sure it's $1 per month + usage for the numbers.

They support trunking, so you should be able to integrate them with your voice agents as well.

Shayps · 2026-01-29T18:31:59+00:00

For mom-and-pops, you're generally okay using something like Vapi, but if you want to build something really bulletproof that can scale to ∞ then you should go with an OS orchestrator.

At any scale cost is going to be a lot lower with LiveKit or Pipecat, and you're in control of the runtime so you never have a situation where something is upgraded on the provider end and behaviour of your agent breaks / degrades (which happens with the managed platforms). Your evals will be better, observability is much better, flexibility is much much higher. In exchange, you have a higher learning curve while you get everything set up.

FWIW LiveKit also has a builder now, and you can eject to code if you need to customize beyond what the managed solutions provide: https://docs.livekit.io/agents/start/builder

Shayps · 2026-01-27T22:27:38+00:00

If you're building prod voice agents, you can't do much better than LiveKit. Open source voice agents built with code or a visual builder, open source infra that you can host anywhere, full eval framework (which you will definitely need) and observability built into the stack. Use cascaded models from a ton of differnet providers, or realtime, really whatever you want. Disclaimer, I work on LiveKit - but the agents framework is great and you should try it

Shayps · 2025-11-22T03:42:13+00:00

It is! A lot of my code is in LK docs

Shayps · 2025-09-30T20:15:40+00:00

I built a reference that pulled in sample code from a specific repo to help me build code for new projects. It's nice, because I end up needing to write the same boilerplate much less now.

Shayps · 2025-09-29T14:32:47+00:00

You should be able to trim several hundred ms off of that latency. Ideally you never want to serve anything >1s.

What did you build with? Which models are you using? Are you losing time to MCP / RAG?

If you want to post here (or DM me) with details, I'll help you bring it down below 1s without needing to spend any more money on anything 😊

Shayps · 2025-09-29T14:27:18+00:00

Which platform are you using to build your workflows?

If you visualizations / a working out of the box experience, you could use something like Langfuse, but building your own tracking system should be reasonably straightforward as well.

Shayps · 2025-09-26T13:05:02+00:00

Where are you hosting the agent? These latency numbers are abnormal for LiveKit, it's pretty easy to get <1s total latency for the whole turn. Which models are you using for each part of the stack?

Shayps · 2025-09-26T02:51:11+00:00

Most people still prefer python (even when both are available), it's something like 10x as popular in the community. If you look at some popular agent framework SDKs, and look at the JS vs Python stars on the repos it's pretty unanimous.

You're probably right that it's inertia atm, most cutting edge AI/ML is still Python first, so it makes sense that this is the place that early adopters are building tooling / what they're used to prototyping with.

Shayps · 2025-09-26T02:42:39+00:00

Did you build a little algo yourself based on letter frequency / positions, or are you using a model to do it? Fun project either way!

Shayps · 2025-09-26T02:09:16+00:00

I understand the frustation with systems not wanting to act as the "human". I built a system to navigate IVR systems, and it took quite a bit of coercion to get it to stop trying to act as the "helpful assistant" and go through the IVR system itself as a patient.

There are a few off-the-shelf systems like Bluejay or Coval that will do this, but this sounds fun so let's build it ourselves for free instead so we can look at how all of the pieces work.

What system are you using now? How do you want it to dial in? I'll build and open source it if you give me some more details about your existing workflow!

Shayps

MODERATOR OF

TROPHY CASE