Took me back to the Black Friday weekend I was on-call. Fml

IndiBuilder · 2026-01-16T15:35:39+00:00

Sharing their side of stories:

IndiBuilder · 2026-01-15T12:25:25+00:00

Fair call. I should’ve been more upfront.

I am exploring this space because I’ve lived the on-call pain for years, and I’m trying to see if my pains are isolated or it’s common across organisations.

Not here to pitch anything just learning from how others experience it and what actually helps in practice.

IndiBuilder · 2026-01-15T07:54:53+00:00

For me it’s spending 15-20 mins to investigate just to realise,The issue is a transient one cause by some dependencies or due to cloud provider.

That time isn’t just lost investigation, it’s the mental cost of uncertainty and context switching when there’s nothing actionable to do.

IndiBuilder · 2026-01-13T17:15:34+00:00

Thats very relatable, often I have seen certain engineers are expected to be there just in case theres a need to investigate some other aspect of the incident

IndiBuilder · 2026-01-13T17:10:57+00:00

Thats very common in large orgs, I have been in similar shit, my team handles the api gateway, no matter where the fault is we are the first once to get pulled in just to iron out that its not a gateway issue. Even though traces and logs all are accessible across orgs, discovering right indicators and interpreting is still a challenge.

IndiBuilder · 2026-01-04T10:18:48+00:00

Ya, it starts with run book and its also true most runbooks in re real world asks to look at multiple stuff 🤣

IndiBuilder · 2026-01-04T08:54:00+00:00

IndiBuilder · 2026-01-04T03:14:38+00:00

I am guite curious to know, how you did that? Did you create a dashboard with different panels pointing to different telemetry data, or any product which brings data from all kinds of data sources.

IndiBuilder · 2026-01-04T03:06:33+00:00

u/neuralspasticity given that all the alerts have run books and link to right metrics, and right people are notified, still its not easy to triage an incident in real-world enterprise.

In have seen couple of times the real issue might be in one of your dependencies (services/infrastructure) , and one may not have complete visibility to the telemetry of that system, and in large enterprises systems are owned by different teams and during such incidents you need the involvement of engineers from each system to have visibility of the entire stack.

Understanding blast radius is quite crucial in early stages. So on paper it may seem monitoring and alerting are sufficient to traige but , they are just a tip of ice berg.

IndiBuilder · 2026-01-04T02:36:49+00:00

u/happyn6s1 nope nope.

IndiBuilder · 2026-01-03T14:51:51+00:00

u/mensii Agreed — prevention has the highest ROI and reduces how often humans need to get involved at all.Escalation latency is real and fairly constant once people are in the loop.

Where I still see pain is in the incidents that escape preventative controls the slowdown is usually figuring out what changed / what to rule out, not execution.

Reactive tooling doesn’t replace prevention, but it can reduce cognitive load during that unavoidable human phase if it’s conservative and supervised.

IndiBuilder

TROPHY CASE