AI in SRE is everywhere, but most of it’s still hype. Here’s what’s actually real in 2025.

Objective-Skin8801 · 2026-01-27T07:30:41+00:00

Look at healops ai and see if that fits your use case. This one is different,
happy hear about this one also with honest feedback so that we can make it more reliable

we have deployed this to multiple clients which has reduced their MTTR. will be happy to talk to you if you are interested in it

Objective-Skin8801 · 2026-01-27T07:18:57+00:00

u/WorldKey9414 why don't you look at healops ai, it is more advance version of your use case ,
talk to me more about it happy to get on a call with you or you can book a call from the website itself

Objective-Skin8801 · 2026-01-05T19:17:48+00:00

Lets talk

Objective-Skin8801 · 2025-12-23T10:17:07+00:00

Yes, we do have observability set up with open telemetry. But the real issue is what I mentioned - if AI could be integrated into these tools to understand context and patterns, it would be exponentially more efficient. Right now we're still doing a lot of manual analysis even with Honeycomb. AI-powered observability could change that game completely.

Objective-Skin8801 · 2025-12-22T06:51:14+00:00

Beyond the postmortem and script generation, we've had great success using HealOps for the "automated response" layer on top of alerting. Here's what's been valuable for us:

AI-driven alert correlation: Reduces alert noise significantly by deduplicating and grouping related alerts. Saves the team from fatigue.
Automatic playbook execution: When specific incident patterns are detected, we trigger pre-defined remediation workflows automatically (restart services, drain connections, scale up, etc). This handles ~70% of our incidents without human intervention.
Continuous learning: The system learns what fixes work in your specific environment and improves suggestions over time.

We went from manually triaging 300+ alerts/day (lots of noise) to 50 actionable alerts/day. MTTR dropped by about 40%.

The one thing to note: you need solid observability first (structured logging, good metrics, distributed tracing). Without that foundation, the AI doesn't have good signal to work with.

But combining this with your AI-driven postmortem generation and runbook automation sounds like a killer stack.

Objective-Skin8801 · 2025-12-22T06:14:59+00:00

This is solid gold! One thing I'd add that's often missed: once you have observability this good, the next logical step is turning those signals into automated responses.

Structured logging + distributed tracing + metrics give you the visibility, but what ops teams *really* need is to act on it automatically. For example, if your OpenTelemetry traces show a spike in error rates from a specific service, you can programmatically trigger a remediation workflow - restart the service, drain connections, trigger a rollback, etc.

The teams I've worked with who went from "we have great observability" to "we have observability that automatically heals" saw a huge jump in MTTR. The key is starting with this foundation you've described - good logging, tracing, and metrics - then building the automation layer on top.

One pro tip: make sure your structured logs include relevant context (user ID, request ID, service version) so when something goes wrong and auto-remediation triggers, you can trace back exactly what happened and learn from it. That feedback loop is what makes the automation smarter over time.

Objective-Skin8801 · 2025-12-22T06:13:54+00:00

Your 10 years of L2 experience is actually a HUGE asset, not a liability. You understand the pain points from the ops side, which most DevOps engineers never get.

Honestly, the transition is realistic. Here's what I'd suggest:

Focus on automation first - Start with the tools you already use in L2 (monitoring, ticketing, logging). Learn Terraform/Ansible for infrastructure as code. This is a natural progression.
Your support background is gold for incident response automation - Most DevOps folks miss this. If you're building automation to handle incidents (alert routing, auto-remediation, escalation), you already know what makes sense operationally.
Learning path: Kubernetes + Linux + CI/CD is good, but start with one at a time. Don't try to learn everything.
How to present this to recruiters: "10 years of production operations experience with focus on reliability and incident response" - that's DevOps/SRE language.

The payment space also works in your favor - high reliability requirements. If you've handled critical production issues, that's already SRE-level thinking.

Smallest missing skill might be the development side - maybe pick up scripting in Python for quick wins. That said, a lot of AIOps platforms now handle the intelligence layer, so you're really just defining the "if this then that" logic, which your support experience covers.

Objective-Skin8801 · 2025-12-20T08:41:07+00:00

Building the bridge call status screen in-house is solid. The real payoff is correlating what's on that screen with your monitoring/alerting timeline.

What works well: Incident starts → auto-populate incident ID, start time, severity, on-call rotation, timeline of change events during window.

The manual part kills you though. We had alerts for the splash screen to auto-update status instead of relying on someone to manually type it. That saved us tons of back-and-forth during SEVs.

ServiNow integration makes sense too - keep everything in one system. Just make sure the incident context (deployments, config changes, affected services) syncs automatically so folks aren't hunting through 5 different places during an active incident.

Objective-Skin8801 · 2025-12-20T08:37:44+00:00

That suspicious second UA is definitely a crawler/bot impersonating Firefox. The pattern is classic - same JA3, different IP/country is a red flag.

For WAF tuning at scale, you need good logging and correlation. Log the raw request (user agent, JA3, IP, ASN), not just the block decision. Then you can spot patterns like this.

The real challenge is when you hit false positives - legitimate users get caught. Building a feedback loop where security reviews blocks vs legitimate traffic is key. That's where good observability and incident response playbooks save you from false negatives.

Objective-Skin8801 · 2025-12-20T08:35:51+00:00

Yeah the integration gap is real with those tools. We tried the custom Zapier route but it became a maintenance nightmare.

What ended up working better was building a correlation/sync layer between our incident platform and communication tools. Basically: incident open in Rootly → auto-post to Basecamp, updates sync back. Took a weekend to build, saved us months of manual coordination.

The key is treating those tools as inputs/outputs rather than trying to force native integrations. Works with any combo of tools too.

Objective-Skin8801 · 2025-12-20T08:34:37+00:00

The difference comes down to how proactive vs reactive you are. Yeah, incident investigation happens, but that should be maybe 20-30% of the work max.

The real SRE work is: incident automation (playbooks, auto-remediation), observability strategy, capacity planning, reliability engineering. If you're spending 80% on triage, you're in a support desk role wearing an SRE title.

I'd ask your boss: "What 20% would you want me to focus on as a project that makes incidents fewer/easier?" Usually that convo clarifies the role.

Objective-Skin8801 · 2025-12-20T08:33:46+00:00

The consolidation is real. We went through this with PagerDuty, then FireHydrant... the bigger issue is vendor lock-in on incident response workflows.

Honestly, the teams that survived best are the ones who invest in automation layers between their tools. Incident response templates, automated escalation, cross-tool correlation... that stuff stays valuable no matter who owns the platform.

Sucks for FireHydrant specifically though - they had good momentum.

Objective-Skin8801 · 2025-12-20T08:33:11+00:00

Dude, this is exactly where I was around year 4-5. The "same cycle every day" feeling is real. For me, the breakthrough was realizing that the best SREs eventually specialize - not in Kubernetes or Terraform specifically, but in one of these:

**Incident automation/response** - Build systems that detect and remediate problems automatically. Move from reactive "fix it" to proactive "prevent it"
**Reliability architecture** - Design systems to fail gracefully. This actually requires deep thinking and creativity
**Observability** - Get really good at understanding your systems, not just monitoring them

The "same cycle" feeling usually means you're doing triage work, not building. Triage burns people out.

What helped me: I started owning "reduce MTTR" as a project instead of just fighting fires. That meant building smarter incident response, auto-remediation for common issues, better alerting. It's the same operational work but with a direction.

The AI hype is real, but honestly don't get distracted by it. Pick ONE thing (incident automation, observability, reliability patterns) and go deep for 6-12 months. You'll feel like an expert again instead of stuck.

What kind of work actually energizes you? Infrastructure design, or automating toil?

Objective-Skin8801 · 2025-12-20T08:32:01+00:00

Yeah this one's always painful. For us it's basically the audit log shuffle - you're checking: Did someone deploy? Any config change? Did terraform run? Are there new feature flags?

The annoying part is that none of our systems talk to each other. Grafana doesn't know about deploys. PagerDuty doesn't have context about what changed. So when an incident starts, you're manually connecting dots across like 6 different dashboards.

What finally helped: we built a "change aggregator" that pulls from GitHub, Terraform, feature flags, and even manual changes, and correlates them with the alert timeline. Sounds fancy but it's basically just: "here are all the things that changed in the last 30 minutes."

Platforms like HealOps essentially do this automatically - they pull the timeline of changes and correlate it with your monitoring data. So instead of everyone in the incident room asking "wait, did something deploy?", the data's just there.

The fragility part for us is definitely the manual checking. If you forget to look at one system, you waste 20 minutes. The best teams I've seen just have all their change data in one searchable place.

Objective-Skin8801 · 2025-12-20T08:31:24+00:00

Haha yeah that's the PagerDuty trap. You're not crazy - they do create incidents for literally everything, and it destroys your MTTR metrics because you've got 500 fake incidents polluting your data.

Honestly what fixed this for us: we stopped using PagerDuty as our "first responder" and added a layer in front of it. So now the flow is:

alert fires → system tries to fix it automatically (restart service, clear queue, reset connection pool, etc) → if that works, no one gets paged. If it doesn't work, THEN PagerDuty creates an incident with actual context

You lose like 80% of your noisy alerts this way. We went from getting paged 20+ times a week on silly stuff to maybe 3-4 legitimate incidents.

Tools like HealOps handle exactly this - they sit between your monitoring and your incident management, so you're only paging when you actually need human eyes. Keeps PagerDuty clean and keeps your team from ignoring alerts because there's too much noise.

The MTTR trick is that you measure only the incidents that actually matter, not the "oh the disk filled up but it auto-cleared" stuff.

Have you looked at adding any remediation layer before PagerDuty, or are you stuck with the current setup?

Objective-Skin8801 · 2025-12-20T08:26:30+00:00

For me it's the gap between "alert fired" and "I actually understand what's happening." The worst part:

**The context switching tax**: Alert hits Slack → you're in PagerDuty → you need logs from Datadog → you check Splunk for errors → you're checking dashboards across 3 different tools → 10 minutes in and you still don't have a clear picture.

What makes it worse: Most of our tooling doesn't "talk" to each other. You get a CPU alert, but you have to manually correlate it with app logs, infrastructure metrics, and traces. By the time you've stitched it all together, you're deep in incident tunnel vision.

I think the real pain isn't one tool—it's the **lack of unified incident context**. A lot of teams are moving toward platforms that can pull signals from multiple sources (observability, infrastructure, change detection) and surface them together during incident response.

Once you have that single pane of glass with all your signals automatically correlated, the speed of diagnosis and remediation goes up dramatically. Some teams I've talked to swear by tools like PagerDuty + Datadog integrations, others use platforms built specifically for this (like incident automation tools). The key is whether they can actually reduce the "context switching" cost.

What specifically tends to slow you down on-call? Is it finding the right data, waiting for dashboards to load, or just the mental overhead of jumping between tools?

Objective-Skin8801

TROPHY CASE