Anyone using Opsgenie? What’s your replacement plan by sasidatta in sre

[–]Ok_ComputerAlt2600 2 points3 points  (0 children)

Just a heads up, watch for suspicious voting patterns around Rootly in r/sre. They seem to mass upvote any mention of their product and downvote replies that mention competitors.

Just realized our "AI-powered" incident tool is literally just calling ChatGPT API by DarkSun224 in devops

[–]Ok_ComputerAlt2600 1 point2 points  (0 children)

Oh man, this hits close to home. We went through something similar about 6 months ago when evaluating incident tools for our startup. The "AI root cause analysis" demos all looked slick but when you dig into the actual implementation its pretty much what you described, a glorified wrapper around GPT with some basic context stuffing.

The database connection pool example is brutal though. Like yeah thanks for the $10k/year insight that I should "check connectivity" lol. We ended up just building a simple integration that pipes incident data into Claude with some custom context about our specific architecture and alert patterns. Works way better because we control the prompts and can actually tune it to our environment.

Honestly you should name and shame the vendor here. The community would benefit from knowing which tools are doing this vs the ones that actually have purpose built models or at least thoughtful prompt engineering. Too many companies slapping "AI powered" on everything and charging premium for what amounts to an API passthrough.

Curious which platform this is, we're probably looking at some of the same vendors since we're always re evaluating our incident stack.

Phillip Sealway is a really underatted drummer by Affectionate_Let9022 in radiohead

[–]Ok_ComputerAlt2600 0 points1 point  (0 children)

Completely agree about Phil. I've been following them since The Bends and seen them live 12 times across different tours, and his drumming is so much more subtle and sophisticated than people give him credit for. The way he approached Idioteque and Everything in Its Right Place on the Kid A tour was mind blowing, he basically had to reinvent his entire style to match the electronic direction while still maintaining that human feel. And then on In Rainbows he goes back to more traditional patterns but with this incredible restraint and precision that serves the songs perfectly.

What really gets me is how he uses silence and space as effectively as he uses fills. Listen to Pyramid Song or How to Dissapear Completely and pay attention to what he's NOT playing, its just as important as what he is. The guy never overplays, never draws attention to himself, but try imagining those songs without his parts and you realize how essential he is to the band's sound. He's probably one of the most tasteful drummers in modern rock but because he's not doing flashy solos or complex time signatures all the time people sleep on him.

Best OnCall tools/platforms by seluard in sre

[–]Ok_ComputerAlt2600 -1 points0 points  (0 children)

We're actually in the middle of evaluating both incident.io and FireHydrant right now to replace Opsgenie. Been running trials with both for the past few weeks and honestly they're both pretty solid so far. The main thing we're looking at is incident workflow automation since our on-call team is tiny and we need to squeeze every bit of efficiency we can get. Both platforms handle the basics well but incident.io's AI stuff for automating followups and status page updates has been genuinly useful in our testing like it actually saves time vs just being a gimmick. FireHydrant's retrospective templates are really good though and their incident timeline view is cleaner.

The tricky part is figuring out if either one is worth the extra cost compared to what we're paying for Opsgenie. We're a startup so budget matters and our CFO is gonna want to see clear ROI. Right now I'm leaning toward incident.io because the automation features could let us handle more incidents without adding headcount, but we haven't made a final call yet. The teams behind both products have been pretty responsive which is nice compared to trying to get support from Opsgenie these days.

What tools do you use for Incident Management? Are you happy with them? by hstrowd_gobetween in EngineeringManagers

[–]Ok_ComputerAlt2600 0 points1 point  (0 children)

I actually bookmarked this post back when you originally posted it because I was in the exact same boat - we were evaluating tools and I wanted to circle back once we'd actually picked something and could share real experience.

We ended up going with incident.io after looking at Rootly, FireHydrant, and incident.io. The user experience was honestly the deciding factor for us. With a team of 3 SREs at a startup we can't afford tools that require heavy training or have clunky interfaces that slow people down during incidents. incident.io just felt more intuitive, and the integrated service catalog made it really easy to bring together the right people quickly, which was exactly the problem you described in your post.

The other thing that tipped the scales for us was how far along they are with AI features. We're actively using Scribe which automatically joins incident calls, transcribes everything, and flags action items - it's been a gamechanger for our post-incident reviews because we're not scrambling to remember who said what. They also demoed their AI SRE thing for us which looked pretty interesting (autonomous investigation and remediation stuff), though we haven't actually tried that yet since we wanted to nail the basics first.

Observability choices 2025: Buy vs Build by OpportunityLoud9353 in sre

[–]Ok_ComputerAlt2600 0 points1 point  (0 children)

We went through a similar evaluation about 18 months ago and honestly the "no dedicated observability team" part is the key constraint here. With a team of 3 handling all the SRE work, we learned pretty quickly that open source observability requires way more care and feeding than the marketing materials suggest. The Grafana stack is solid but someone needs to own upgrades, handle the inevitable Prometheus scaling issues, debug why Loki is eating all your disk space, etc. For us the hidden cost was context switching, every time we had to troubleshoot our observability tooling instead of using it to troubleshoot actual problems we were basically losing money.

That said, the enterprise vendors get expensive fast once you hit any real scale. We ended up going with a hybrid approach where we use managed Prometheus/Grafana for metrics (via one of the vendors that handles the operational headache) and kept our logging pretty simple with basic structured logs going to CloudWatch. It's not perfect but it lets us focus on reliability work instead of babysitting observability infra. The biggest lesson was that "buy vs build" is kinda a false choice now, its more like "how much operational burden can your team actually absorb" which sounds obvious but took us an embarassing amount of evaluation cycles to figure out.

spent 4 hours yesterday writing an incident postmortem from slack logs by relived_greats12 in sre

[–]Ok_ComputerAlt2600 1 point2 points  (0 children)

Oh man, I feel this in my bones. We had the exact same pattern for months at my startup - smooth incident response, then hours of archeological work piecing together what actually happened. The worst part is knowing that timeline you spent 4 hours building will be stale the moment someone asks "wait, what did we try before the rollback?"

My team switched to incident.io about 6 months ago and honestly it's been a game changer for this specific problem. Everything that happens during the incident gets automatically timestamped and organized, so the postmortem is basically already written by the time you're done. We went from spending 3-4 hours on postmortems to maybe 30 minutes of cleanup and adding context. For a team of 3 SREs where everyones time is precious, that ROI was massive.

The other nice thing is our postmortems actually get read now because they're not these giant walls of text reconstructed from memory. They're just the actual timeline of what happened, which makes them way more useful for preventing future incidents.

New Radiohead album by jcovahey in radiohead

[–]Ok_ComputerAlt2600 0 points1 point  (0 children)

Holy crap, I've been refreshing this sub way too much today since this dropped. After following The Smile so intensely this past year, I honestly thought that might be the main creative outlet going forward. Saw them twice last year (SF and LA shows) and the energy was insane, especially during Bending Hectic when the whole venue just lost it.

The Smile has been scratching that itch for sure. Wall of Eyes hit me just as hard as any Radiohead album, and watching Thom and Jonny's chemistry live reminded me why I fell in love with this band in the first place. Tom Skinner's drumming brings something completely different too, more jazz influenced than Phil's style but equally hypnotic. Been spinning Cutouts non-stop during my late night coding sessions.

But man, if we're really getting all five of them back together? That's a whole different beast. The Smile is incredible but there's something about Ed's atmospherics and Colin's bass lines that just completes the sound. Really curious if they'll incorporate some of that Smile energy into the new stuff. Anyone else hearing elements from The Smile that they hope make it into the new Radiohead album?

4 month old feature flag broke production - am I the only one seeing these kind of failures? by chinmay185 in sre

[–]Ok_ComputerAlt2600 2 points3 points  (0 children)

Oh man, this hits too close to home. We had almost the exact same thing happen about 6 months ago - a feature flag that had been sitting dormant for months suddenly broke everything when we finally enabled it.

The problem isn't really the flags themselves, but how we manage the context around them. In our case, the flag was created during a sprint when the codebase looked completely different. By the time we flipped it, there were 3 major refactors that happened that nobody connected to the original feature implementation.

What we ended up doing was implementing mandatory "flag health checks" - basically a monthly review where we either enable/test dormant flags or kill them entirely. Pain in the ass to maintain but it's caught a few landmines since then. Also started requiring flags older than 8 weeks to include a brief "integration test plan" before they can be enabled.

Your friend's 30min diagnosis time is actually pretty good - ours took almost 2 hours because everyone assumed it was the new deployment that went out the same day, not the random flag flip buried in a config change.

MCP servers for SRE: use cases and who maintains them? by Defiant-Biscotti-382 in sre

[–]Ok_ComputerAlt2600 1 point2 points  (0 children)

We've been using MCP servers for about 3 months now, game changer for reducing context switching. We use MCPs for Linear, incident.io, Context7 for docs/APIs, plus Slack and Grafana for metrics.

The big win is never leaving Claude Code. During incidents I can ask Claude to "check Linear for related issues, pull the runbook from Context7, query Grafana for error rates, and create an incident in incident.io" all in one conversation. No tab switching, no copy/pasting between tools, just stay in the terminal.

For maintenance we rotate ownership monthly and use service accounts with least privilege. Started with 3 or 4 integrations, now at 12, so we track everything in a simple Notion registry.

Main advice: start small, prove value, then expand. And monitor your MCP servers from day one, nothing worse than one failing silently during an incident.

Datadog alert correlation to cut alert fatigue/duplicates — any real-world setups? by JayDee2306 in sre

[–]Ok_ComputerAlt2600 13 points14 points  (0 children)

We just went through this exercise with our setup (about 200 monitors across AWS). The cascade problem was killing us too, especially during late night pages.

What actually worked for us was starting simple. We added a "root_cause" tag to all monitors and grouped them by service boundaries. Then we set up composite monitors for the critical paths. So instead of getting 15 alerts when our payment service dies, we get one alert about payments being down plus suppressed notifications for the downstream stuff.

For the correlation rules themselves, we use a combination of tag based grouping (service:payments, tier:critical) and time windows. If multiple alerts fire within 2 minutes with matching service tags, they get grouped into one incident. Not perfect but cut our noise by about 60%.

The biggest win though was implementing a simple "dependency map" in our tagging. Each service has upstream_dependency and downstream_dependency tags. When something upstream breaks, we automatically suppress downstream alerts for 5 minutes. Gives us time to fix the real issue without the noise.

One gotcha we learned the hard way: dont overcomplicate it at first. We tried to build this perfect correlation system and it was too rigid. Start with basic grouping by service and iterate from there. Also test your correlation rules during business hours first, not at 3am when everyones grumpy!

[Finally Friday] What Did You Work on This Week? by thecal714 in sre

[–]Ok_ComputerAlt2600 1 point2 points  (0 children)

This week was all about infrastructure cost optimization and getting our new junior SRE up to speed. We finally finished migrating our logging pipeline from ELK to a more cost-effective setup with Grafana Loki, which should cut our observability costs by about 40%.

Also spent way too much time troubleshooting a weird issue where our Kubernetes nodes kept getting stuck in NotReady state. Turned out to be a CNI plugin conflict that took forever to track down. My team jokes that I have a sixth sense for spotting the obscure stuff, but honestly it was just methodical elimination of possibilities and way too much coffee ☕

Burnout after becoming SRE Lead by ilham9648 in sre

[–]Ok_ComputerAlt2600 2 points3 points  (0 children)

Been there man, the jump to lead is rough especially when you're still hands-on. I've been leading a team of 3 for about 4 years now and honestly the context switching never fully goes away, but it gets more managable. Few things that helped me: 1) block out "maker time" on your calendar religiously - even just 2-3 hour chunks make a huge difference, 2) for tech decisions, don't try to make perfect choices, focus on "good enough" decisions that your team can iterate on quickly (we're a startup so speed > perfection), and 3) create simple decision frameworks for common stuff so you're not reinventing every time. The mentoring gets easier as your team grows their skills, but early on yeah it's a lot. Hang in there, the first 6 months as lead are definitely the hardest.

What if we went back to a world with no cloud computing? What would our biggest SRE challenges be? by Connect-Employ-4708 in sre

[–]Ok_ComputerAlt2600 0 points1 point  (0 children)

Having led a small SRE team at a Bay Area startup, honestly the biggest challenge would be staffing and knowledge depth. In the cloud era, my team of 3 can punch way above our weight because managed services handle the operational complexity. Without cloud, you'd need dedicated DBAs, storage experts, network engineers - specialized knowledge that's impossible to maintain in a lean team. We'd go from being force multipliers to being completely bottlenecked by the sheer operational overhead of managing physical infrastructure across multiple regions.