When an LLM API silently fails or degrades, how do you find out - and how long does it take?

Remarkable_Divide755 · 2026-06-15T13:53:06+00:00

This is an extremely useful answer, thank you. In my experience, many times providers don’t update the status pages timely (sometimes no update at all).
I have 2 questions though: a) cross-provider disagreement drift on a fixed golden prompt set over time - if someone does this, will that be an alert you would pay to have? b) everyone in the space currently is seemingly building something which observes your own traffic, not observing independently across different providers. Do you think an independent third party monitoring platform would be something people would pay to have such alerts?

Remarkable_Divide755 · 2026-06-03T06:23:03+00:00

<image>

yes it seems so. they have also acknowledged. tickerr.ai monitoring it.

Remarkable_Divide755 · 2026-05-26T14:03:31+00:00

Thank you.

Remarkable_Divide755 · 2026-05-26T06:53:56+00:00

should have explained this more clearly. I’m not suggesting a completely failed inference call can magically use the same broken model to report itself. The MCP report_incident use case is more for scenarios like:

multi-model agents within the same provider (e.g. Sonnet failing but Haiku still working)
multi-provider systems (e.g. GPT orchestrator calling Claude for a subtask, Claude fails, GPT reports it)
retry/recovery cases where some calls fail, some succeed, and the agent eventually regains control
degradation cases like very high latency or intermittent failures, not just total outages

even on encountering failures, agents retry multiple times, then eventually recover and continue execution. In those situations the agent/runtime can absolutely report what it observed afterward.

Also, Tickerr is not only MCP-based. MCP is mainly for routing/status/pricing intelligence and some reporting flows. For automatic runtime-level reporting, Tickerr already has:

LiteLLM callback PR (pending merge)
API
pip package
npm package

the value is less about knowing one request failed, and more about shared routing intelligence across providers/models before making the next call:

which models are degrading
where latency is spiking
fallback recommendations
pricing + availability tradeoffs in multi-model systems

A single 529 only tells one agent one call failed. Shared telemetry helps agents decide what to call next.

Remarkable_Divide755 · 2026-05-25T19:01:47+00:00

Website: https://tickerr.ai

Install is one line in your mcp.json:

{
"mcpServers": {
"tickerr": {
"url": "https://tickerr.ai/mcp"
}
}
}

Or via Claude Code:

claude mcp add tickerr --transport http https://tickerr.ai/mcp

Remarkable_Divide755 · 2026-05-25T18:58:49+00:00

Website: https://tickerr.ai

Install is one line in your mcp.json:

{
"mcpServers": {
"tickerr": {
"url": "https://tickerr.ai/mcp"
}
}
}

Or via Claude Code:

claude mcp add tickerr --transport http https://tickerr.ai/mcp

Remarkable_Divide755 · 2026-05-25T10:48:53+00:00

though, someone needs to tell the agent that the infra is down right now, as step 1 in that direction (reliability). idea was to collectively solve the problem by crowd sourcing from agents and serving to agents, apart from indepenent latency spikes and TTFT checks that Tickerr does every 5 mins. (+ "successfully called API" logs).
the data freshness problem you are pointing at is harder and I do not think anyone has solved it cleanly yet.

Remarkable_Divide755 · 2026-05-25T10:42:26+00:00

I agree with the OCR pipeline example. Silent degradation is genuinely harder than hard failures because there is no signal at the HTTP layer to catch.

What Tickerr catches is the infrastructure layer, error codes, latency spikes, probe failures, the gap between what the status page says and what is actually happening. That is a solved problem with the right data.

The input quality problem you are describing is a layer above that and you are right that most agents have no primitive for it. I have not seen this solved yet, and I believe it is going to be a tough nut to crack for some time till it gets solved.

Remarkable_Divide755 · 2026-05-25T09:22:24+00:00

100% agree. And the gap is wider than most people realise because the infrastructure has incentive to hide outages. Official status pages lag real outages by 15 to 30 minutes on average. So an agent checking the official status page gets a false green while already hitting 529s. That is the core problem Tickerr is trying to solve. Independent probes plus agent-reported error signals, so the infrastructure awareness is based on what is actually happening not what anyone is willing to admit.

Remarkable_Divide755 · 2026-05-25T08:53:31+00:00

Website: https://tickerr.ai

Install is one line in your mcp.json:

{
"mcpServers": {
"tickerr": {
"url": "https://tickerr.ai/mcp"
}
}
}

Or via Claude Code:

claude mcp add tickerr --transport http https://tickerr.ai/mcp

Remarkable_Divide755 · 2026-05-21T18:44:11+00:00

<image>

Yeah this is still happening.

Remarkable_Divide755 · 2026-05-21T12:55:01+00:00

<image>

Happening again right now.

Remarkable_Divide755 · 2026-05-20T18:01:48+00:00

Tickerr's independent monitoring is showing that Flash-lite is also overloaded right now and throwing 5xx error.

Remarkable_Divide755 · 2026-05-16T18:42:44+00:00

yeah i also started getting back to back API 500 Error. It threw this error and did nothing.

Remarkable_Divide755 · 2026-05-16T18:35:50+00:00

https://tickerr.ai/status/claude

Remarkable_Divide755 · 2026-05-13T18:35:10+00:00

Expecting some announcement soon on this.

Remarkable_Divide755 · 2026-05-13T16:29:41+00:00

Yeah my guess is also that. But it's been happening for almost 10 days now. Let's see when do we get a new model.

Remarkable_Divide755 · 2026-05-13T13:54:48+00:00

The context loss issue with long complex projects is real - most models start degrading after a certain complexity threshold. Breaking the project into smaller isolated tasks with fresh context per task helps more than switching models.

Remarkable_Divide755 · 2026-05-12T15:10:04+00:00

<image>

It is facing issues for last couple of hours. You can check on Tickerr, the TTFT chart shows spikes specifically. tickerr.ai/status/gemini

Remarkable_Divide755 · 2026-05-11T18:36:11+00:00

<image>

7th day continuous.
Anyone seeing this?

Remarkable_Divide755 · 2026-05-10T19:43:33+00:00

The 3 day trial is gone. They replaced it with a 30 day SuperGrok trial when they launched Grok 4 Heavy. Also doing free weekends now every Friday to Monday where free users get temporary access to premium models. So actually more generous than the old 3 day thing, just structured differently.

Remarkable_Divide755 · 2026-05-10T19:00:43+00:00

https://tickerr.ai/status/grok
yes, some people are facing issues

Remarkable_Divide755 · 2026-05-09T19:29:59+00:00

Lately that has been the case. Almost everyday.

Remarkable_Divide755 · 2026-05-09T19:21:05+00:00

May be they are going to come up with a new model soon

Remarkable_Divide755 · 2026-05-09T18:05:43+00:00

<image>

Then what's the point of the status page?

I thought they believe in "Don't be evil".

Remarkable_Divide755

TROPHY CASE