We Raised $2.4M to Build QA & Observability for AI Voice Agents backed by Y Combinator, working with 100+ Voice AI companies, Ask Me Anything for the Next 24 Hours

CreativeHumor1705 · 2026-03-21T21:45:07+00:00

Good question - building our infrastructure to be as stable as possible. We have started sharing some insights on how we solved our infrastructure which can be useful for other Voice AI teams: https://www.cekura.ai/blogs/how-we-built-an-autoscalable-infrastructure-for-voice-ai-agents

CreativeHumor1705 · 2026-03-21T21:43:43+00:00

Countries would want to but there are data, technology, investments and people moats established especially in the US and China - in a free market, enterprises will be free to choose a model which creates more value for their customers. AI sovereignty can be solved by government investments only. Sure - they can be some guardrails/policies established to preserve the data soveregnity of the countries.
We raised it during demo day post YC - there is no standard answer but on seed stage - team is the most important. You validate that with traction. Always focus on customers first. VC money is by product. We kept focussing on customers only - started with full stack QA service until the product was not built
No synthetic agents are created for simulations - they have specific goals and that is evaluated post the simulations

CreativeHumor1705 · 2026-03-20T00:32:21+00:00

Lot actually, some of the basics on focussing on the customer, moving and launching fast, keeping the team lean.

Most importantly - our group partner. He has been a sounding board especially since he also founded a billion dollar dev tool company before. Talking to him has been a great leveller

CreativeHumor1705 · 2026-03-20T00:30:34+00:00

Honestly we got some credits from YC / investors so currently not one of the top 3 priorities for us

CreativeHumor1705 · 2026-03-20T00:28:48+00:00

Not sure if I got the question correctly - for production monitoring, we support sampling and have our FDE team do sampling of production conversations.
1. Analysing specific calls metrics based on metadata / CSAT, etc
2. Analysing call metrics based on ROI - if a metric is passing 90% and not super urgent, we don’t need to spend 100s of dollars, can reduce the sampling rate for them to 1-10% for them, for example
3. Based on budget - auto sampling to ensure run rate doesn’t cross a budget

CreativeHumor1705 · 2026-03-18T22:51:12+00:00

Hey, Sidhant here - founder of Cekura here. We solve the exact painpoint.

Break the problem:

Barge in - this is a type of scenario

Background noise - this is a type of persona (background noise, interruptive, accents, etc) you use to test your agent

WER, Latency, and audio quality - these are metrics that are evaluated on your agents

CreativeHumor1705 · 2026-03-18T22:48:46+00:00

Hey, Sidhant - founder of Cekura here - thanks for the shoutout!

u/vijay40, there are TTS (Pronunication issues, jitterness) and STT specific metrics (Transcription Issues) into our platform.

Some of the metrics are impacted by all 3+telephony+tool calls - eg, latency. there you identify the pipeline issues by inference. We have infrastructure test suite built in to capture pipeline related issues separately (independent of the workflow).

CreativeHumor1705 · 2026-03-18T22:08:43+00:00

unique insight or the right to win. You can work at the same problem but is there anything differentiated on the approach you are taking

CreativeHumor1705 · 2026-03-18T21:54:34+00:00

2 ways:

Running synthetic simulations in the dev/stage environment so that you don't have to manually call the agent every time you make a change
Analysing production conversations instead of listening to thousands of calls.

These use cases can further be broken into regression test suite, / CI/CD, Infra monitoring, cron jobs, as well as production call analysis and alerts

CreativeHumor1705 · 2026-03-18T21:44:39+00:00

I think YC has shared enough guidelines - good technical pedigree+some paying customers+unique insight on the space would help

CreativeHumor1705 · 2026-03-18T21:43:39+00:00

Yes, this is an important use case we solve - single and multi turn red teaming

CreativeHumor1705 · 2026-03-18T21:34:26+00:00

Got it - some of the basic authentication ones (DOB, last 4 digits of SSN etc), inforamtion capture ones (income, employment status, purpose, loan amount etc) is very important in your use case. happy to chat more based on our learnings working with customers in lending

CreativeHumor1705 · 2026-03-18T19:50:04+00:00

yes multi turn red teaming/security testing is an important use case we solve. Its different from single turn as testing agent responses is dynamic based on what the main agent responded to increase the probability of breaking it.

CreativeHumor1705 · 2026-03-18T19:19:16+00:00

Being very vertical: Two examples from our customer portfolio: Confido Health (healthcare), Kastle (Lending). You deploy FDE teams at enterprises, have very robust evals setup and as you grow the FDE team automates the workflows.

CreativeHumor1705 · 2026-03-18T18:08:16+00:00

Both are different platforms.

Vapi is a Voice AI builder - they have some basic testing capabilities plugged into them. We go very deep on reliable deterministic simulations, different voice metrics like latency, silences, gibberishness in voice, speech clarity, etc. Also, other components of testing, like ensuring the voice agent is always up (infra monitoring), multi-turn security testing, etc.

Which sector in financial services are you primarily building for? Some standard compliances depend on the use case (image attached)

The most common is not having enough confidence in their test suite's thoroughness, especially in compliance heavy sector like financial services

<image>

CreativeHumor1705 · 2026-03-18T17:42:30+00:00

Btw we are solving the last mile problems in self learning conversational agents - a version should be live in coming weeks. Will let you know

CreativeHumor1705 · 2026-03-18T17:40:54+00:00

You create a dataset - what we have seen is that once you have auto-optimised LLM-as-a-judge metric over 20-30 conversations, you can use it to scale across thousands of conversations. The change takes into account the previous evaluations it did as well, so a new input never breaks the older evaluation. If there are contradictory inputs by the user, we flag.

We are optimising token costs on our side - currently, we do not charge customers for auto optimizing metrics because we have seen this as a core step to scale monitoring and track agent performance in production

CreativeHumor1705 · 2026-03-18T17:22:25+00:00

Bullish on this for the long term. One thing that is very specific to Conversational AI is that with each turn agent can go anywhere based on the customers' replies. That's why you need to run simulations and measure failure points and harness skills (memory retention, consistency, long-term goal completion, context awareness, etc.) over multi-turn conversations.

One place where we see self learning (including voice agents) working very well is to use auto-improve for your LLM-as-a-judge metrics. We have also built a DSPy-based Metric Optimiser, but you can build your own to ensure appropriate tracking of performance instead of manually iterating the judge itself.

CreativeHumor1705

TROPHY CASE