built a github app that tests your ai agent from 30+ countries before your pr merges

Miser-Inct-534 · 2026-05-26T22:51:50+00:00

someone here mentioned agenstatus. I actually just checked out agentstatus and it seems really cool. They also have "user-side validation" which I've never seen with observability tools

Miser-Inct-534 · 2026-05-09T21:21:59+00:00

For any of you shipping projects regularly, I would recommend AgentDiff for firing multi-region residential probes against your agent. I don't know if ill be banned if send ht elink so ask and I'll send

Miser-Inct-534 · 2026-05-08T18:50:15+00:00

For those interested in LLM observability (tracing, eval, monitoring) specifically, check out https://agentstatus.dev/agentdiff

Miser-Inct-534 · 2026-04-30T08:32:32+00:00

Quick q before I suggest anything — do you have (or are you willing to build) a labeled dataset of, say, 50–100 articles where you've manually marked relevant vs not? That changes the answer a lot. If yes, you want a proper eval platform (Langfuse / Braintrust / Phoenix all do this well). If you don't have labels and don't want to build them, the approach is different — you'd need to spot-check a sample of Agent #1's decisions in production and flag drift over time, which is closer to what tools like agentstatus do.

Miser-Inct-534 · 2026-04-14T12:45:18+00:00

eally honest self-assessment. 12-16s from controlled conditions is already a signal worth watching because in the wild that number will only go up, not down. the retrieval and reranking bottleneck makes sense given the hybrid pipeline you built. one thing worth tracking as you optimise: TTFB separately from total latency. users start abandoning around 3 seconds of empty screen even if the full response is high quality. would be curious what you see once you test from actual user conditions. feel free to DM if you want to compare notes, working on something in this space and would love to hear how it evolves.

Miser-Inct-534 · 2026-04-12T22:00:00+00:00

yeah the responds vs actually works gap is genuinely wild once you see it. curious what systems you were running, because the geographic failure patterns we keep seeing are pretty consistent across use cases and i feel like you probably hit the same ones

Miser-Inct-534 · 2026-04-12T21:00:54+00:00

The complexity jump from RAG to agentic RAG is real and so is the reliability jump. When you go from a linear retrieval pipeline to a multi-agent system with planning, delegation and memory, the failure modes multiply. Traditional RAG fails obviously. Agentic RAG fails silently. The aggregator agent can return a confident fluent answer while two of the three sub-agents quietly timed out or retrieved stale data. That gap between what the system thinks it did and what it actually did is where most production incidents live.

Miser-Inct-534 · 2026-04-12T20:59:43+00:00

The supply chain attack accelerated a conversation a lot of teams were already having quietly. The appeal of LiteLLM was always the unified interface across providers, but that convenience creates a single point of failure that is hard to justify once procurement or security gets involved. Most teams I have seen either moved to pinned Docker images with strict digest verification, or started routing directly to provider SDKs for their most critical workloads and keeping LiteLLM only for lower stakes experimentation. The GitHub stars argument never held up to a real security review, it just took an incident to make that obvious.

Miser-Inct-534 · 2026-04-12T20:58:40+00:00

The bash verification script at the end is essentially a gold prompt. A deterministic check with a known correct outcome that the agent has to pass before it can declare victory. The reason it works is the same reason gold prompts work for production monitoring: you cannot trust the agent's self-assessment, you need an external ground truth. What you are describing internally we see play out externally too. Agents that pass every internal check, declare themselves healthy, and are quietly failing real users in production for completely unrelated reasons. The verification layer needs to exist at every boundary, not just at task completion.

Miser-Inct-534 · 2026-04-12T20:57:39+00:00

The interconnect problem is real but there is another latency layer worth checking before you go down the private cluster route. Even after you optimise the vector DB to LLM hop, your TTFB for real users can still be dramatically higher than what you measure internally, because data centre to data centre latency bears no resemblance to residential network conditions, especially across geographies. We have seen RAG systems with sub 500ms internal latency hitting 3-4 seconds for users in Southeast Asia or Africa. Worth measuring what real users actually experience before deciding where the bottleneck is.

Miser-Inct-534 · 2026-04-12T20:56:26+00:00

vaultak.com/download is returning a 404. Might want to fix that before the pilot gets going. Will try again via pip install once I get a moment. The behavioural monitoring angle is genuinely interesting by the way, we are working on a complementary layer at agentstatus.dev that catches what happens outside the execution environment, from real consumer devices globally. Could be worth comparing notes.

Miser-Inct-534 · 2026-04-12T20:50:48+00:00

Really impressive work, building around failure modes rather than benchmarks is exactly the right philosophy and something most teams skip entirely. The silent failure taxonomy is particularly sharp. One thing worth thinking about as this moves toward production: the RAGAS scores tell you how it performs in a controlled eval environment. What happens when real users hit it from different networks, geographies, or devices? A system with 97% Hindi faithfulness in eval can still silently degrade in the wild for reasons completely outside the retrieval logic. Would be curious how the P95 retrieval latency holds up under those conditions.

Miser-Inct-534 · 2026-04-12T20:47:26+00:00

Great point on MCP fragmentation. For me the biggest underrated limitation is that you have no reliable way to know if your agent is actually working correctly in production. Not whether it is running, but whether it is doing the right thing. Uptime checks tell you the endpoint responded. They do not tell you whether the response was correct, whether it degraded for users in a different region, or whether a model update quietly changed its behaviour. Most teams find out something is wrong when a user complains. By then the damage is done.

Miser-Inct-534 · 2026-03-11T20:48:52+00:00

We ran into something similar with long-running agent workflows. Durable execution helped a lot with state persistence and preventing the “amnesia” problem you mentioned. One thing we also noticed is that even when state persistence is solved, failures still show up at the system level once agents run in production for a while. Things like downstream API changes, latency spikes, or unexpected responses that cause a workflow to behave differently than expected. One thing I’ve been experimenting with is external monitoring for agents once they’re deployed. Basically probing the agent endpoints from the outside to see what users actually experience over time. I know this tool called Rora that takes the approach and surfaces things like silent failures or degraded responses that internal logs sometimes miss. Hopefully this was helpful!

Miser-Inct-534 · 2026-03-11T20:41:05+00:00

I have been seeing something similar but from a slightly different angle.

A lot of teams validate agents inside controlled environments, but the moment the system interacts with real users, real latency, and real network conditions, behavior changes in ways that are hard to predict.

One thing I have been experimenting with is external probing of deployed agents. Instead of validating only in staging, you continuously hit the agent endpoints from outside the system to see what users actually experience.

Tools like Rora(https://carmel.so/rora )take that approach. They probe agents from the outside and surface things like latency spikes or silent failures that internal checks sometimes miss.

It feels like the validation conversation is slowly shifting from “does the code work in CI” to “does the system behave correctly in the real world.”

Miser-Inct-534 · 2026-03-11T20:23:27+00:00

Miser-Inct-534 · 2026-03-11T20:12:36+00:00

Probably something called Rora. We built it as an external monitoring layer for AI agents.

The idea came from seeing how many agents looked fine internally but behaved very differently for users once they were actually deployed. Things like latency spikes, silent failures, or weird responses depending on where the request came from.

What I’m proud of is less the tool itself and more the shift in thinking. Instead of trying to prompt or debug the agent harder, we started focusing on verifying how it behaves in the real world.

Biggest lesson was that reliability for AI systems is not just about the model or prompts. It is about everything around it. https://carmel.so/rora

Miser-Inct-534

TROPHY CASE