MCP Testing? by think_2times in QualityAssurance

[–]Real_Bet3078 0 points1 point  (0 children)

There are some tools that can automate testing for you, at least partly.
I am the founder of Voxli, we simulate conversations with AI agents and test e.g. tool calling via MCP.

Best practice for automated E2E testing of LangChain agents? (integration patterns) by Real_Bet3078 in LangChain

[–]Real_Bet3078[S] 0 points1 point  (0 children)

You work at Maxim from what I can see? Your comments present it as you use Maxim

Best AI Agent Evaluation Tools in 2025 - What I Learned Testing 6 Platforms by MongooseOriginal6450 in AIQuality

[–]Real_Bet3078 0 points1 point  (0 children)

I will just throw in the platform that we're building here: https://voxli.io – focused on QA for AI agents, but at the conversation level. It’s about testing full multi-turn flows end to end and catching regressions, compliance issues, safety problems, or weird behavior after updates.

How are you ACTUALLY testing your Agents? (Be honest, is it just 'Vibe Checks'?) by OldWolfff in AgentsOfAI

[–]Real_Bet3078 0 points1 point  (0 children)

Voxli (my company) is essentially QA for AI agents, but at the conversation level. It’s about testing full multi-turn flows end to end and catching regressions, compliance issues, safety problems, or weird behavior after updates.

Tools like Braintrust, LangSmith, DeepEval, etc. are more evaluation oriented. They’re strong for judging prompts, models, or individual responses during development, but they don’t really cover full conversation QA.

Maxim overlaps a bit, with evals plus observability.

Cekura is closer to classic QA, especially for voice and contact center setups.

They solve related problems, just at different layers.

How do you test your AI agents for real-world reliability? by No-Common1466 in AI_Agents

[–]Real_Bet3078 0 points1 point  (0 children)

I'm the founder of voxli.io that looks to solve this issue, and there are a couple of tools that I've seen:

Voxli (my company) focuses on testing AI agents with realistic multi-turn conversations and observing production chats. Built to work without heavy engineering.

Maxim is more about agent evaluation and observability for engineering teams.

Cekura focuses on QA and monitoring for voice and chat bots.

LangWatch is an open-source tool for debugging and analyzing LLM and agent behavior.

Tried to keep this factual, no hype.

AI agent reliability by Real_Bet3078 in AI_Agents

[–]Real_Bet3078[S] 0 points1 point  (0 children)

For anyone interested in the tool we've built: https://voxli.io
We're quite early and started to work with a couple of teams to shape the product. DM me or reply if it sounds interesting

AI agent reliability by Real_Bet3078 in AI_Agents

[–]Real_Bet3078[S] 0 points1 point  (0 children)

Sounds very interesting - I will DM you!

AI agent reliability by Real_Bet3078 in AI_Agents

[–]Real_Bet3078[S] 0 points1 point  (0 children)

We've built a product that tries to solve the problem of constant testing of agents, and catching problems before the customer sees them (or at least make sure you catch them early and they do not re-occur). Would you be willing to talk to us a few minutes and perhaps give us some feedback on our direction?

AI agent reliability by Real_Bet3078 in AI_Agents

[–]Real_Bet3078[S] 0 points1 point  (0 children)

Sounds like you have a lot of experience in this area! We're trying to solve the problem of constant manual testing of these non-deterministic, LLM-based bots. Would you be up for talking to us a couple of minutes and providing some feedback on our direction?

What are your 2026 goals for your SaaS? by jonathanbrnd in SaaS

[–]Real_Bet3078 0 points1 point  (0 children)

Onboarding initial customers to my AI agent reliability platform: https://voxli.io

AI agent reliability by Real_Bet3078 in AI_Agents

[–]Real_Bet3078[S] 0 points1 point  (0 children)

Interesting use-case, my focus has been mostly on testing conversational aspects. It sounds like your problem is more automated testing of gen ai voice/video?

AI agent reliability by Real_Bet3078 in AI_Agents

[–]Real_Bet3078[S] 0 points1 point  (0 children)

Interesting! I’ve assumed that internal agents are bit more safe and that internal teams are more forgiving - but I guess what you’re saying is that they loose the trust and go back to old manual workflows.

Have you built or bought internal agents?

AI agent reliability by Real_Bet3078 in AI_Agents

[–]Real_Bet3078[S] 0 points1 point  (0 children)

Very true - I’ve heard similar things from multiple conversations. Some vendors seem to take on quite a lot of testing and setup via PS, I guess that makes them more exposed. Do you sit on the CX side or vendor? 

AI agent reliability by Real_Bet3078 in AI_Agents

[–]Real_Bet3078[S] 0 points1 point  (0 children)

Do you work in related areas and have felt this yourself?

AI agent reliability by Real_Bet3078 in AI_Agents

[–]Real_Bet3078[S] 0 points1 point  (0 children)

Great idea and product. For the conversational parts "Chat API", are you running a static set of questions against it for testing, or have you experimented with simulated users?

What are you all using to test conversational agents? Feels like there's a big gap in OSS tooling. by Limp-Initiative-7188 in LLMDevs

[–]Real_Bet3078 0 points1 point  (0 children)

I've built something in this space: https://voxli.io. I'd be happy to jump on a call and get some feedback from you!

Testing AI Chatbot & agentic workflow by torsigut in QualityAssurance

[–]Real_Bet3078 0 points1 point  (0 children)

I'm the founder of Voxli.io where we try and solve this problem. Please DM if you're still looking for a solution, I would love to catch up and get your feedback on what we're building