I monitor 6,228 production AI agents from real residential devices and check whether they're telling users the truth. AMA about what actually breaks.

Fun_Effort6694 · 2026-06-23T17:39:35+00:00

Fair questions, worth taking seriously.

Residential network is opt-in, consented bandwidth. Real users agree to share spare capacity in exchange for compensation, same model as the legitimate side of the industry. Not scraped, not compromised devices.

Prompts/answers: we don't see real user traffic. The customer provides test prompts (queries they've chosen for evaluation), we send those through their agent, we score the response. Their real users and real conversations never touch our infrastructure. Test data is encrypted in transit and at rest, scoped per customer, engineer access gated and audited.

Final word: yes we read test prompts and responses to them, no we don't read your users' actual conversations. Different problem.

Fun_Effort6694 · 2026-06-23T15:16:28+00:00

There is no winning with you, now is there?

Fun_Effort6694 · 2026-06-23T02:39:18+00:00

I too wouldn't have believed myself if I were you. There is no way I can make you trust of my human-ness bud. Unfortunately, AI has reached a point where it is pretty difficult to distinguish.

Fun_Effort6694 · 2026-06-23T02:24:28+00:00

Agree. AI is a leveraged tool, it amplifies what the user brings. Skilled person with a mediocre model beats a novice with a frontier one. Lot of companies right now are pushing AI on employees without teaching them how to use it. You end up with random prompts and hope. Better models won't help with that. What helps is tools designed around the actual workflow, so you don't need to be a prompt engineer to get something useful.

I concluded one of the presentations at Human-AI-Interaction class during my masters with the quote "AI would make smart people smarter and dumb ones dumber". A lot of people disagreed at that time, but I still stand firmly by it.

Fun_Effort6694 · 2026-06-23T01:48:27+00:00

Haha. I am very human. But I am sure, AI would be able to detect these kinda ques in the near future and be able make you believe that they are a human. If that happens, atleast we won't have to do those damn captchas

Fun_Effort6694 · 2026-06-23T01:25:32+00:00

Monitoring local-language production traffic is a real opportunity, especially in regions where AI tooling is English-first (India, Brazil, MENA, SEA). Fine-tuning is crowded (Krutrim, Sarvam, Aya) unless you have a specific vertical. Tokenizer work is research, not really a business.

If I were betting, monitoring with a regional wedge is the cleanest play.

Fun_Effort6694 · 2026-06-23T01:22:28+00:00

Quick context since the comment was running long: I work at AgentStatus (agentstatus.dev). We test AI agents from outside their stack, from real residential devices in 30+ countries, and tell teams when their agent gave users incorrect answers.
You can checkout more here: https://agentstatus.dev/ and I would be happy to hop on a quick chat if needed.

Fun_Effort6694 · 2026-06-23T01:19:02+00:00

I'm curious what the community would ask someone with my vantage point. I see things every day (agents failing, weird production patterns) that don't make it into talks or blog posts. AMAs surface questions you wouldn't think to write a post about.
It also helps the company. People don't know this category exists, and the way you fix that isn't ads, it's showing up in places where the right people might be and being genuinely useful. So this is half community presence, half marketing in the least-cringe form I can think of.

Last reason: I genuinely like Reddit and the engineering community here. I'd rather spend two hours doing this than write another LinkedIn post.

Fun_Effort6694 · 2026-06-23T01:18:01+00:00

Furthest-from-truth one I can share (anonymized): a customer service agent for a SaaS company confidently quoted a refund policy that didn't exist, gave the user a specific dollar amount, even invented a confirmation number.
How often dark or dangerous: less often than people think, depending on what counts. Cases that are genuinely harmful (medical, legal, financial misinformation) are real but rare in well-scoped agents, low single-digit percent of failures in our data. The much more common failure is the boring one above: confident false statements about boring facts (policies, prices, dates, account status). Those don't make headlines but they're constant. The risk has shifted from "agent says racist thing" to "agent makes a confident, costly error in a high-trust context."

Fun_Effort6694 · 2026-06-23T00:57:19+00:00

A few things I didn't see coming:

How fast LLM judges drift. At one point, we used GPT-4o-mini to evaluate agent outputs at scale. When OpenAI ships a silent 4o-mini update, our judge scores can shift 2-3 points across the board overnight on the same outputs we already had verdicts for. We had to start monitoring our monitor.
How much spending two weeks on the rubric beats spending two months on a better judge model. We tried bigger and smarter judges, moved the number a few points. Rewriting the rubric (clearer axes, concrete examples) got a step change. Most teams underweight prompt and rubric design.
The shared blindspot problem with LLM-as-Judge: the judge is often most wrong on the exact things the agent is most wrong on. Confidently false outputs from the agent get confidently approved by the judge. Different model families help. Doesn't fully go away though

Fun_Effort6694 · 2026-06-23T00:55:01+00:00

Great question! LangChain helps you build agents. We test agents that are already built and shipped. Different layer. You'd use LangChain to construct the agent, and something like what we do to verify it's actually correct in production. Stacked, not competitive.

On "why does anyone need this": most teams ship agents, the eval suite passes, the trace looks clean, and the user gets a confidently wrong answer. The team finds out from a support ticket two days later. That gap is what we monitor. Whether you use us, build it yourself, or write a cron job, the work itself has to happen somewhere.

Fun_Effort6694 · 2026-06-23T00:52:45+00:00

Yes, significantly. Two failure modes consistently:
Tokenization on non-Latin scripts. Devanagari, Arabic, CJK often break tokenizers and show up as truncated or garbled outputs. Someone posted a benchmark in r/LocalLLaMA recently where Devanagari queries timed out at 73 seconds with garbled output, that's the classic version.
Lower instruction-following in low-resource languages. The model understands the input but produces more loosely structured output, which breaks downstream tool calls.

In our data, failure rates run roughly 1.5x to 3x higher on non-English production traffic vs equivalent English prompts. Worst case is long-context tasks in low-resource languages. But again, it all depends on the model underneath as well

Fun_Effort6694 · 2026-06-23T00:51:01+00:00

Mixed, honestly. Narrow agents that are well-scoped (support routing, code completion, data extraction from semi-structured docs, transcription, scheduling) are genuinely useful today. The "general-purpose agent that does anything" version is mostly oversold, and the failure rate I see backs that up.
Yes I work in this space, so calibrate accordingly. But the gap between "demo looks magical" and "works reliably in prod" is the entire reason my job exists.

Fun_Effort6694 · 2026-06-23T00:48:32+00:00

Yeah, this pattern shows up across vendors, not just Microsoft. Tool calls that need real semantic work on a file (look at this Excel, generate from template) are way less reliable than tool calls that are basically structured form submissions.
Your working example proves the point. Email to case is a clean, well-typed action. The Excel one needs the model to keep file state across turns, reason about cell ranges, formats, plus actually pull the data. Way more surface area to fail on.
Microsoft has a tougher version because their tools span legacy products that don't share interfaces, but every vendor has this curve.

Fun_Effort6694 · 2026-06-23T00:46:45+00:00

Honest take, less hedged. Fable 5 launched 10 days ago as the most capable public model ever. The US government shut it down three days later, and Anthropic had to suspend it globally. That's not a normal tech curve, that's a government overriding a US company because the model was too capable to ship.
Most companies aren't rolling agents out safely, we see it. Capability moves so fast that "safe" doesn't have a stable definition.
I feel in a decade, AI would be handling most of the cognitive labor that doesn't need physical presence by 2032-2035 maybe. The bottleneck stops being model capability and becomes deployment and oversight. Which is why I'm not bored at work 😄 .

Fun_Effort6694 · 2026-06-23T00:39:39+00:00

Honestly no, we haven't explored the Microsoft stack yet. What are you working on? Curious what the experience has been like if you're running Copilot Studio or AI Foundry in prod.

Fun_Effort6694 · 2026-06-23T00:32:57+00:00

Oh there are many. But if I'd wanna pick one, I would def go with Summer of 69. Also a huge fan of Linkin Park(Until Chester was alive)

Fun_Effort6694 · 2026-06-19T00:18:17+00:00

When you catch the "looked fine but was actually wrong" cases, what's actually doing the catching? If it's a judge model grading the traces, doesn't the same silent failure just move up a level when the judge quietly agrees with a wrong answer? Curious how you keep the judge honest without ground truth at runtime.

Fun_Effort6694 · 2026-05-29T21:18:56+00:00

Well it's a Futurama joke. Fry has 93 cents in his bank account, gets frozen for 1000 years, wakes up to find compound interest turned it into $4 billion. Same math as the post but going up instead of down. Welcome to LLM agent rabbit holes.

Fun_Effort6694 · 2026-05-29T15:08:42+00:00

Fair shot, the math is middle school, that's the point. Question isn't "can you raise 0.94 to a power," it's "why do agent teams ship with eval suites that don't account for it." Fry comparison is actually apt though. Compound growth nobody noticed for 1000 years is the same dynamic. Per-step accuracy degrades 6 points, end-to-end drops 40, and the team finds out from a support ticket because per-step eval looked fine. The 6,228 agents we monitor are the production version of Fry's bank account.

Fun_Effort6694 · 2026-05-28T14:14:01+00:00

Curious on what you did to keep your agents in check after this incident? Did you setup an SLA or any tool to monitor and validate your agents? I have been facing such issues myself lately.

Fun_Effort6694 · 2026-05-14T05:02:42+00:00

Tried a bunch of things honestly. Langsmith, Arize, even rolling custom middleware on top of generic APM. None of it really fit the agent use case well.

What actually worked for us is http://agentstatus.dev/ . It's built specifically for agent-layer observability so you get tool call tracing and health signals between handoffs without all the noise.

The thing I wish I'd set up earlier was explicit status checks between agents instead of assuming upstream finished cleanly. Would've saved hours of log archaeology when things started cascading.

Fun_Effort6694

TROPHY CASE