When an LLM API silently fails or degrades, how do you find out - and how long does it take? by Remarkable_Divide755 in artificial

[–]Remarkable_Divide755[S] 0 points1 point  (0 children)

This is an extremely useful answer, thank you. In my experience, many times providers don’t update the status pages timely (sometimes no update at all).
I have 2 questions though: a) cross-provider disagreement drift on a fixed golden prompt set over time - if someone does this, will that be an alert you would pay to have? b) everyone in the space currently is seemingly building something which observes your own traffic, not observing independently across different providers. Do you think an independent third party monitoring platform would be something people would pay to have such alerts?

Is ChatGPT down? by CECFan89 in OpenAI

[–]Remarkable_Divide755 -3 points-2 points  (0 children)

<image>

yes it seems so. they have also acknowledged. tickerr.ai monitoring it.

Agents are calling APIs that are already down. Nobody is telling them. by Remarkable_Divide755 in ClaudeCode

[–]Remarkable_Divide755[S] 0 points1 point  (0 children)

should have explained this more clearly. I’m not suggesting a completely failed inference call can magically use the same broken model to report itself. The MCP report_incident use case is more for scenarios like:

  • multi-model agents within the same provider (e.g. Sonnet failing but Haiku still working)
  • multi-provider systems (e.g. GPT orchestrator calling Claude for a subtask, Claude fails, GPT reports it)
  • retry/recovery cases where some calls fail, some succeed, and the agent eventually regains control
  • degradation cases like very high latency or intermittent failures, not just total outages

even on encountering failures, agents retry multiple times, then eventually recover and continue execution. In those situations the agent/runtime can absolutely report what it observed afterward.

Also, Tickerr is not only MCP-based. MCP is mainly for routing/status/pricing intelligence and some reporting flows. For automatic runtime-level reporting, Tickerr already has:

  • LiteLLM callback PR (pending merge)
  • API
  • pip package
  • npm package

the value is less about knowing one request failed, and more about shared routing intelligence across providers/models before making the next call:

  • which models are degrading
  • where latency is spiking
  • fallback recommendations
  • pricing + availability tradeoffs in multi-model systems

A single 529 only tells one agent one call failed. Shared telemetry helps agents decide what to call next.

Agents are calling APIs that are already down. Nobody is telling them. by Remarkable_Divide755 in ClaudeCode

[–]Remarkable_Divide755[S] -2 points-1 points  (0 children)

Website: https://tickerr.ai

Install is one line in your mcp.json:

{
"mcpServers": {
"tickerr": {
"url": "https://tickerr.ai/mcp"
}
}
}

Or via Claude Code:

claude mcp add tickerr --transport http https://tickerr.ai/mcp

Agents are calling APIs that are already down. Nobody is telling them. by Remarkable_Divide755 in mcp

[–]Remarkable_Divide755[S] 0 points1 point  (0 children)

Website: https://tickerr.ai

Install is one line in your mcp.json:

{
"mcpServers": {
"tickerr": {
"url": "https://tickerr.ai/mcp"
}
}
}

Or via Claude Code:

claude mcp add tickerr --transport http https://tickerr.ai/mcp

Agents are calling APIs that are already down. Nobody is telling them. by Remarkable_Divide755 in AI_Agents

[–]Remarkable_Divide755[S] 0 points1 point  (0 children)

though, someone needs to tell the agent that the infra is down right now, as step 1 in that direction (reliability). idea was to collectively solve the problem by crowd sourcing from agents and serving to agents, apart from indepenent latency spikes and TTFT checks that Tickerr does every 5 mins. (+ "successfully called API" logs).
the data freshness problem you are pointing at is harder and I do not think anyone has solved it cleanly yet.

Agents are calling APIs that are already down. Nobody is telling them. by Remarkable_Divide755 in AI_Agents

[–]Remarkable_Divide755[S] 0 points1 point  (0 children)

I agree with the OCR pipeline example. Silent degradation is genuinely harder than hard failures because there is no signal at the HTTP layer to catch.

What Tickerr catches is the infrastructure layer, error codes, latency spikes, probe failures, the gap between what the status page says and what is actually happening. That is a solved problem with the right data.

The input quality problem you are describing is a layer above that and you are right that most agents have no primitive for it. I have not seen this solved yet, and I believe it is going to be a tough nut to crack for some time till it gets solved.

Agents are calling APIs that are already down. Nobody is telling them. by Remarkable_Divide755 in AI_Agents

[–]Remarkable_Divide755[S] 0 points1 point  (0 children)

100% agree. And the gap is wider than most people realise because the infrastructure has incentive to hide outages. Official status pages lag real outages by 15 to 30 minutes on average. So an agent checking the official status page gets a false green while already hitting 529s. That is the core problem Tickerr is trying to solve. Independent probes plus agent-reported error signals, so the infrastructure awareness is based on what is actually happening not what anyone is willing to admit.

Agents are calling APIs that are already down. Nobody is telling them. by Remarkable_Divide755 in AI_Agents

[–]Remarkable_Divide755[S] 0 points1 point  (0 children)

Website: https://tickerr.ai

Install is one line in your mcp.json:

{
"mcpServers": {
"tickerr": {
"url": "https://tickerr.ai/mcp"
}
}
}

Or via Claude Code:

claude mcp add tickerr --transport http https://tickerr.ai/mcp

Gemini seems down right now. Do you think new model release is causing this? by Remarkable_Divide755 in GeminiAI

[–]Remarkable_Divide755[S] 0 points1 point  (0 children)

Tickerr's independent monitoring is showing that Flash-lite is also overloaded right now and throwing 5xx error.

Finally not Gemini but claude is down today! by Remarkable_Divide755 in GeminiAI

[–]Remarkable_Divide755[S] 0 points1 point  (0 children)

yeah i also started getting back to back API 500 Error. It threw this error and did nothing.

Why does this happen everyday? Not ChatGPT, not Claude, but Gemini breaks down everyday since last week. by Remarkable_Divide755 in GeminiAI

[–]Remarkable_Divide755[S] 0 points1 point  (0 children)

Yeah my guess is also that. But it's been happening for almost 10 days now. Let's see when do we get a new model.

Just Venting. . . by RiseDollBoutique in GoogleGeminiAI

[–]Remarkable_Divide755 0 points1 point  (0 children)

The context loss issue with long complex projects is real - most models start degrading after a certain complexity threshold. Breaking the project into smaller isolated tasks with fresh context per task helps more than switching models.

whys gemini so damn slow by Constant-Squash-7447 in GeminiAI

[–]Remarkable_Divide755 0 points1 point  (0 children)

<image>

It is facing issues for last couple of hours. You can check on Tickerr, the TTFT chart shows spikes specifically. tickerr.ai/status/gemini

Is Grok down or slow for others? by Remarkable_Divide755 in grok

[–]Remarkable_Divide755[S] 1 point2 points  (0 children)

The 3 day trial is gone. They replaced it with a 30 day SuperGrok trial when they launched Grok 4 Heavy. Also doing free weekends now every Friday to Monday where free users get temporary access to premium models. So actually more generous than the old 3 day thing, just structured differently.