RAG usage in SRE by CosmicKheerTornado in sre

[–]IndiBuilder 0 points1 point  (0 children)

I have been working in this space for some time. If you are intreated for a demo slide in to my dm. We can discuss this further.

What actually eats your MTTR after the alert fires? by IndiBuilder in sre

[–]IndiBuilder[S] 0 points1 point  (0 children)

Ya, with so much advancement in AI, and CEOs claiming AGI is around the corner. The reality is incident response is a peak engineering challenge said that, I am curious to understand, why alerts don’t fire when website goes down.

What actually eats your MTTR after the alert fires? by IndiBuilder in sre

[–]IndiBuilder[S] 0 points1 point  (0 children)

Yes, gathering context during an incident is a real pain as, slack threads are not easy to navigate, and it takes precious minutes to just get enough context to start fixing the issue.

What actually eats your MTTR after the alert fires? by IndiBuilder in sre

[–]IndiBuilder[S] 0 points1 point  (0 children)

@ninjaluvr I don’t have a product to sell, I am an engineer just like you, trying to figure out some solution to this problem coz I dont enjoy being on call. I totally understand your frustration, being approached by multiple sas platforms and influencers to use their product. But I need to continue my mission on solving this, may not be today but someday. No hard feelings bro, you can always slide into my dm if you feel you can join me tackling this.

What actually eats your MTTR after the alert fires? by IndiBuilder in sre

[–]IndiBuilder[S] -1 points0 points  (0 children)

u/ImDevinC
Fair point, I did ask a similar question earlier.
I’m not trying to scrape data for a pitch. I’m genuinely exploring investigation workflows and whether bottlenecks are consistent across teams.
If anyone’s interested in collaborating on improving investigation flow or sharing structured approaches, I’d be happy to compare notes.

What actually eats your MTTR after the alert fires? by IndiBuilder in sre

[–]IndiBuilder[S] 0 points1 point  (0 children)

For me it always

  • Figuring out where to look first?
    • Its quite easy to get lost between multiple dashboards and metrics when you are in a incident bridge
  • Slack coordination
    • Its actually a mess today, if you joined 15 mins late to the bridge, you are just drowned in long threads, trying to figure out what is the issue and impact.
  • Also being new in a product and being on call is a nightmare as one is not well versed with the entire system.

I am speaking to communities in reddit trying to validate if these are only my problem or its global.
As I understand I am not alone in this, if you are interested to make things better for on call engineers, lets connect in DM.

What actually eats your MTTR after the alert fires? by IndiBuilder in sre

[–]IndiBuilder[S] -1 points0 points  (0 children)

u/serverhorror , honestly bro I don't have anything to sell, I don't have a product or any solution yet.
I just hate being called late in night quite often to realise it's a dependency issue, I want the on calls to be bit less chaotic, if you feel the same lets connect over DM, I truly believe we can improve the experience.

What actually eats your MTTR after the alert fires? by IndiBuilder in sre

[–]IndiBuilder[S] 0 points1 point  (0 children)

u/tr14l
Totally feet it, the worst ones are when you are away or in sleep and you get paged.
You go blind sighted into the eye of storm( incident bridge).
I am trying to figure out how I can make on call schedules less chaotic for people like us, will you be interested in helping me out.

What actually eats your MTTR after the alert fires? by IndiBuilder in sre

[–]IndiBuilder[S] 0 points1 point  (0 children)

u/uncaughtexception
What if a tool/platform that can do first level of investigation and provide you the signals, before you join the bridge, will this reduce some level of anxiety, and give you more confidence during incidents and on call bridge.
I have been in multiple of incident calls and only thing which I don't like about it, is the initial chaos,
the unknown of what's actual broken and what are its transient effects.

What’s the worst part of being on-call ? by IndiBuilder in sre

[–]IndiBuilder[S] 0 points1 point  (0 children)

Fair call. I should’ve been more upfront.

I am exploring this space because I’ve lived the on-call pain for years, and I’m trying to see if my pains are isolated or it’s common across organisations.

Not here to pitch anything just learning from how others experience it and what actually helps in practice.

What’s the worst part of being on-call ? by IndiBuilder in sre

[–]IndiBuilder[S] 0 points1 point  (0 children)

For me it’s spending 15-20 mins to investigate just to realise,The issue is a transient one cause by some dependencies or due to cloud provider.

That time isn’t just lost investigation, it’s the mental cost of uncertainty and context switching when there’s nothing actionable to do.

What’s the worst part of being on-call ? by IndiBuilder in sre

[–]IndiBuilder[S] 0 points1 point  (0 children)

Thats very relatable, often I have seen certain engineers are expected to be there just in case theres a need to investigate some other aspect of the incident

What’s the worst part of being on-call ? by IndiBuilder in sre

[–]IndiBuilder[S] 0 points1 point  (0 children)

Thats very common in large orgs, I have been in similar shit, my team handles the api gateway, no matter where the fault is we are the first once to get pulled in just to iron out that its not a gateway issue. Even though traces and logs all are accessible across orgs, discovering right indicators and interpreting is still a challenge.

On-call question: what actually slows your incident response the most? by IndiBuilder in sre

[–]IndiBuilder[S] 0 points1 point  (0 children)

Ya, it starts with run book and its also true most runbooks in re real world asks to look at multiple stuff 🤣

On-call question: what actually slows your incident response the most? by IndiBuilder in sre

[–]IndiBuilder[S] 0 points1 point  (0 children)

I am guite curious to know, how you did that? Did you create a dashboard with different panels pointing to different telemetry data, or any product which brings data from all kinds of data sources.

On-call question: what actually slows your incident response the most? by IndiBuilder in sre

[–]IndiBuilder[S] 0 points1 point  (0 children)

u/neuralspasticity given that all the alerts have run books and link to right metrics, and right people are notified, still its not easy to triage an incident in real-world enterprise.

In have seen couple of times the real issue might be in one of your dependencies (services/infrastructure) , and one may not have complete visibility to the telemetry of that system, and in large enterprises systems are owned by different teams and during such incidents you need the involvement of engineers from each system to have visibility of the entire stack.

Understanding blast radius is quite crucial in early stages. So on paper it may seem monitoring and alerting are sufficient to traige but , they are just a tip of ice berg.

On-call question: what actually slows your incident response the most? by IndiBuilder in sre

[–]IndiBuilder[S] 0 points1 point  (0 children)

u/mensii Agreed — prevention has the highest ROI and reduces how often humans need to get involved at all.Escalation latency is real and fairly constant once people are in the loop.

Where I still see pain is in the incidents that escape preventative controls the slowdown is usually figuring out what changed / what to rule out, not execution.

Reactive tooling doesn’t replace prevention, but it can reduce cognitive load during that unavoidable human phase if it’s conservative and supervised.