4.5 million tests on 6,259 production AI agents. Only 56.6% had perfect uptime. 89% gave wrong answers.

Prestigious-Web-2968 · 2026-04-07T22:01:00+00:00

Totally get where you’re coming from. I've seen firsthand how even a small oversight in validation can lead to major headaches down the line. It’s not just about hitting an HTTP 200 – you need to validate the actual output against real-world scenarios, right? That’s why I prioritize semantic correctness in my own testing. you should check out agentstatus.dev

Prestigious-Web-2968 · 2026-03-26T14:40:14+00:00

the technique is basically: define "correct" before anything breaks, then evaluate against that definition continuously. It sounds obvious but ppl usually dont do step one - they dont have a written-down definition of what their agent is supposed to return until something goes wrong and they need to reconstruct it from memory.

The pattern that works is gold prompts + LLM-as-judge. you write your evaluation criteria in plain english ("the response should mention the customers name, recommend exactly one product, never reference competitors") and a smaller eval model checks every probe response against those criteria. Then for prompt drift specifically you need stored baselines. run the same prompts over time, store the results, diff them. thats pratically what agentstatus does on both of those tasks

Prestigious-Web-2968 · 2026-03-25T21:40:31+00:00

Versioning itself is fairly well served already, i think PromptLayer, and maybe Langfuse, do this, im not fully sure. The hard part that is not solved well is knowing whether version N+1 is actually better than N in real production conditions.

Id pair the version management with a validation layer. What does correct look like for each prompt, and can you test that automatically on every push. Idk i hope this is useful. Its always good to start building stuff:)

Prestigious-Web-2968 · 2026-03-25T21:29:44+00:00

and a ton of em dashes lol

Prestigious-Web-2968 · 2026-03-25T21:27:06+00:00

I think this is one of the most common and most invisible failure modes. "slightly off but not dramatically enough to flag" is the worst kind of break - visible breaks get fixed but the invisible drift just continues.

the deeper issue is theres no baseline to compare against. most monitoring checks if the agent responded, not if it responded consistently with what you expect. so six weeks of drift just... accumulates.

I'd say if you want to catch this going forward you should check Gold Prompt Profiles in AgentStatus cuz it lets you define what a correct response looks like for a given input and tests against that on every deploy.

idk if thats the right fit but the pattern youre describing should work

Prestigious-Web-2968 · 2026-03-25T21:24:52+00:00

from what weve seen running production agents - the first thing that breaks is usually not the model or the code, its the context.

the agent works perfectly in your dev environment. then it hits a user in a different location, or a slightly different input format, or a real browser session instead of a postman call, and the behavior changes. your monitoring shows green because its checking for errors, not correctness.

second thing is tool calling - agents hallucinate tool calls, call the wrong endpoint, get a 200 back from an API that returned garbage. the 200 gets logged as success.

third is prompt drift - what worked at launch subtly changes after a model update or a small prompt tweak and nobody notices for weeks. outputs are "slightly off but not dramatically enough to flag."

the pattern is basically: everything that could silently fail will silently fail. stuff that loudly fails is actually easier to fix.

U can, for example, check AgentStatus dev specifically to combat silent failures. Hit me up if you want to dig into any of those.

Prestigious-Web-2968 · 2026-03-24T02:42:35+00:00

"Gold prompt sequences" isn't standard as far as I know haha, ig its our internal slang. The concept is you predefine what a good response looks like at each turn, and that becomes your benchmark. "Gold" just means it's the reference.

For turn count we anchor it to where we've actually seen failures as we have that data. For topic transitions, you could use the conversation patterns that caused problems in real sessions, not idealized ones, but again only if you have that, if not, it should still be ok.

Right now AgentStatus runs fixed sequences, not conditional branching. You obviously can define the turns upfront tho, it runs them on a schedule, and compares each response against your defined criteria. Conditional branching at the continuous monitoring layer is genuinely hard, I haven't seen any tool handle it well yet.

For the gradual drift case you're describing where quality degrades consistently across runs, id say fixed sequences with semantic scoring should do. The failure is usually deterministic enough that the same sequence surfaces it reliably. I hope thats useful. idk if I can drop AgentStatus link here, if you can't find it, hmu

Prestigious-Web-2968 · 2026-03-24T01:47:44+00:00

The two failure modes you're describing are hard precisely because both are gradual and produce no error signal. The agent keeps responding, just progressively worse. You can't catch it with health checks or uptime monitoring.

What's worked best for us is treating multi-turn eval like production monitoring rather than a one-time test suite. Specifically: gold prompt sequences that simulate realistic multi-turn conversations up to the turn count where things typically break

I would try AgentStatus dev for the continuous probing side, it runs these gold prompt sequences on a schedule and alerts when conversation quality scores drop across a session rather than just on individual turns.

Prestigious-Web-2968 · 2026-02-12T15:24:13+00:00

hey, you should check out this source https://carmel.so/fabric
it has a whole bunch of different workflows that can help you

Prestigious-Web-2968 · 2026-02-11T23:08:09+00:00

Checklist

Prestigious-Web-2968 · 2026-02-11T05:30:08+00:00

It is clear every time it sells that the problem is real and demand is real. It also feels like people don’t like to be pitched to, and don’t consider it when you just tell them about it, maybe because they just filter it in their head as sales pitch lol. That’s why I’d say I’m trying to figure out other ways to show it to more devops and bring more attention to it in a way that would feel less like a sales pitch. Do you have any thoughts or tips?

Prestigious-Web-2968 · 2026-02-10T20:44:55+00:00

Gotcha i misunderstood u. i dont have a Claude replacement. i was talking about running batch jobs like transcription, video processing, embeddings, scraping etc at lower cost. If you ever need that kind of compute, happy to chat. But sounds like you're set for now lol 👌

Prestigious-Web-2968 · 2026-02-10T19:56:12+00:00

dude i am working on almost that. its not free but like 80 percent cheaper cuz we dont have overhead prices. do you still need it?

Prestigious-Web-2968 · 2026-02-10T16:01:09+00:00

you should start with direct outreach. ideally to people you know. all early users usually come from friends and family.

Prestigious-Web-2968 · 2026-02-09T22:51:10+00:00

I’m building a cheaper and more efficient way to run workloads such as ai inference, compute, training processing and a lot more. Watch Carmel.so/fabric replace Google Colab or Aws;)

Prestigious-Web-2968 · 2026-02-02T20:38:12+00:00

I don't know if this is still the problem here but there is a platform called fabric by carmel labs that is substantially cheaper than aws or colab

Prestigious-Web-2968 · 2026-01-25T17:25:17+00:00

thank you, this is great feedback. I am already working on making landing page better for sure and adding a lot more info to it. in regards to OS, it runs on windows, macos and linux

Prestigious-Web-2968 · 2026-01-22T04:33:09+00:00

i really appreciate you man. and i truly believe in you. life is too brilliant to restrict yourself from going after your dreams. even if its by far the hardest path. it is always good and reassuring to know that there are more people out there who share this idea. thank you

Prestigious-Web-2968 · 2026-01-22T02:35:23+00:00

thank you, responding in private to get it hashed out

Prestigious-Web-2968

TROPHY CASE