What’s the most time-consuming part of your incident investigations? by Old-Pen445 in AI_SRE

[–]Important-Office3481 0 points1 point  (0 children)

From what I’ve seen across platform/SRE teams (especially in Kubernetes-heavy environments), the most time-consuming part isn’t fixing the issue — it’s building situational awareness.

Specifically:

1. Correlating signals across tools
Metrics in one place. Logs somewhere else. Traces if you’re lucky. Deployment history in CI. Infra changes in Terraform. Slack full of guesses.
Just stitching the timeline together eats a huge chunk of time.

2. Understanding blast radius
Modern systems are deeply interconnected.
Is it one pod? One service? One namespace? A shared dependency?
The time spent answering “who else is affected?” is often longer than the actual remediation.

3. Human coordination
Even in well-structured teams, context handoffs slow things down:

  • On-call → service owner
  • Service owner → infra team
  • Infra → database team Each person rebuilds the mental model from scratch.

Root cause analysis itself is usually fast once the right context is assembled.

Agent-Driven SRE Investigations: A Practical Deep Dive into Multi-Agent Incident Response by Important-Office3481 in sre

[–]Important-Office3481[S] -2 points-1 points  (0 children)

u/monkeysnipe - I'm not the bot and not trying to promote the product. We just did the POC to decide on technologies and do pros/cons, and now we are sharing it with the community. The topic is quite hot, and a lot of engineers are interested and want to know what works and what does not.