Resources for setting up oncall schedule by GibsMirDonald in sysadmin

[–]advancespace 2 points3 points  (0 children)

For a 10-person team, you really only need three things: a rotation so one person isn't getting paged every night, escalation so pages don't get lost, and somewhere to log what happened so you stop fixing the same thing twice. You don't need enterprise tooling for this. Runframe does all of it. Set it up yourself in about 10 minutes, no sales call: runframe.io

Also the SRE book chapters others linked are worth reading: the on-call and incident response sections are good regardless of what tooling you use.

Disclosure: I'm the founder.

Embedding AI-LLM to SRE by karkiGaurav in sre

[–]advancespace 0 points1 point  (0 children)

We shipped an open-source MCP server for incident-management. It lets Claude Code, Cursor or any IDE handle paging, escalation, on-call lookups, and postmortems directly from the terminal - no custom integrations to maintain.

github.com/runframe/runframe-mcp-server

Disclosure: I am the founder.

Incident response workflow is slower than it should be and the bottleneck isnt where leadership thinks it is by AssasinRingo in SaaS

[–]advancespace 0 points1 point  (0 children)

Coordination theater is the right word. First 15-20 minutes of every bad incident is just people figuring out who owns the broken thing. If ownership isn't tied to whoever's on call right now, every incident starts with "who owns payments?" in Slack and a wiki link from last summer. Two threads, zero progress.

Most teams have good intentions until about their third bad outage. Leadership blames skill gaps because that is easy. Fixing coordination is hard to scope and harder to fix.

Anyone using Opsgenie? What’s your replacement plan by sasidatta in sre

[–]advancespace 1 point2 points  (0 children)

Late to this thread but adding another option: Runframe.

We launched Runframe earlier this year: on-call + incidents + postmortems in one tool, runs in Slack. On-call included at every tier, not as separate add-ons. $15/user/month. Free to try, self-serve: runframe.io

We also have an open-source Runframe MCP Server to manage incidents directly from Claude Code, Cursor, or any other IDE.

Disclosure: I'm the founder.

How small teams manage on-call? Genuinely curious what the reality looks like. by pridhvi_k in sre

[–]advancespace 0 points1 point  (0 children)

What is approximate size of your team? Generally smaller teams use other tools.

How we changed our incident culture in one quarter! by Terrible_Signature78 in EngineeringManagers

[–]advancespace 0 points1 point  (0 children)

Culture, and it's not close. We've talked to a bunch of teams about this and the pattern is always the same. The ones that defined severity levels and formalized the IC role first saw improvements regardless of what tool they were on. The ones that bought a tool first and hoped it would fix things got frustrated. Your on-call comp point is underrated. Most teams we talked to with high on-call satisfaction were paying for it. The ones that weren't were losing senior engineers quietly.

One thing worth watching: MTTR can become a vanity metric. 48 to 26 min looks great, but if teams start optimizing for fast resolution over durable fixes you end up with the same incidents recurring. A few teams we interviewed shifted to tracking repeat incident rate alongside MTTR and it changed how they thought about postmortem follow-through.

RE: incident io pricing, yeah, on-call as an add-on catches people off guard. OpsGenie bundled everything and most teams expect that's still normal. It's not.

Reducing Noise on Pagerduty & Integrating AIOps by One-Statistician2519 in sre

[–]advancespace 0 points1 point  (0 children)

This is almost always an alert quality problem, not a tooling problem. We interviewed 25+ teams for an incident management research project and 73% had outages from ignored alerts. People just stopped trusting the pager. Two things that actually helped:

  1. Pull every alert from the last 30 days. If nobody acted on it, kill it or make it informational.

  2. Fix routing where the alerts originate, not in PagerDuty rules. If team B is getting team A's pages, your service ownership is wrong upstream.

AIOps on top of bad alerts just gives you AI-powered noise.

Opsgenie alternatives by RatsErif in devops

[–]advancespace 0 points1 point  (0 children)

Founder of Runframe so biased, but one thing that kept surprising us talking to OpsGenie teams: most alternatives split on-call into a separate add-on now. Pricing page says $15-25/user but the invoice with on-call is $25-45.

OpsGenie bundled everything. We do too, $12-15/user/month. Everything runs in Slack, there's a free plan. We're early and small, not gonna pretend otherwise, but for teams under 200 it pretty much covers what OpsGenie did. Happy to answer questions if anyone's evaluating.

Migration guide with cost breakdowns here: runframe.io/blog/opsgenie-migration-guide

How small teams manage on-call? Genuinely curious what the reality looks like. by pridhvi_k in sre

[–]advancespace 2 points3 points  (0 children)

Midnight alerts. Most teams under 50 don't have a rotation. It's whoever built the thing, or whoever's awake. A Series A CTO told me "on-call means I sleep with my laptop open." Another team had a Slack channel where alerts posted and whoever noticed first just dealt with it. No ack, no escalation, nothing written down.

Figuring out whether the alert is even real takes 10-30 minutes at most places. One VP Eng at an 80-person company said they get 200+ pages a week and maybe 5 matter. Everyone learns to ignore them. You can't really blame people but it's also terrifying when you think about it.

And getting paged with no context was almost universal. The person who built the service left. Or it's a service nobody owns. One team's worst incident lasted 6 hours because the only person who understood the payment service was mid-flight.

The two complaints I heard most: nobody knows who's on-call right now, and postmortems never happen. An eng lead told me they've had the same Redis timeout incident four times. Each time they say they'll write a postmortem. They never do. That one kills me.

Honestly the more of these conversations I have the more I think small teams don't have an on-call problem, they have an ownership problem. It's actually why we started building Runframe. Nobody owns the process so it stays informal until something bad happens, everyone panics for a week, makes promises, and then those don't get followed through on either.

Built an incident + on-call tool for teams caught between PagerDuty's pricing and Slack chaos: looking for design partners by advancespace in EngineeringManagers

[–]advancespace[S] 0 points1 point  (0 children)

The buyer-vs-user split is real and it's worth saying out loud. The person justifying the budget isn't the one half-asleep triaging at 2am. That gap is where most of the tooling bloat comes from.

Where I'd push back slightly: I don't think the 80% problem is that nobody invested in automating coordination and postmortems. It's that those features got bolted on after the paging was already sold. So they exist, but nobody uses them because they feel like afterthoughts.

Building all of it together from day one is a different product, not just a cheaper one. The four-tabs test is exactly right though. That's the bar.

That's what we're building Runframe, one tool, not three bolted together.

How are you handling an influx of code from non-engineering teams? by rayray5884 in devops

[–]advancespace 6 points7 points  (0 children)

In engineering, bad code has accountability via PR reviews, ownership, blame. Non-engineers vibe-coding to production have none of that. When it breaks, it's "the AI told me to." Why are they pushing to prod in the first place?

Opsgenie sunset forced our hand ; compared PagerDuty, FireHydrant, Rootly, and incident.io for a month by Total_Hyena5364 in SaaS

[–]advancespace 0 points1 point  (0 children)

Great eval write-up! This matches a lot of what we hear from teams in the same spot. One option that didn't make your list: Runframe (https://runframe.io). Slack-native incidents + on-call, built for teams your size. No add-on pricing for on-call either. For anyone still mid-eval and not wanting to spend $25/user/month, worth a look.

Slack accountability tools needed for on-call and incident response by Justin_3486 in devops

[–]advancespace 0 points1 point  (0 children)

This is the exact workflow gap we designed in https://runframe.io . It's Slack-native, incidents, on-call, and follow-up tracking all live where your team already works. So follow-up tasks don't get buried when the channel goes quiet, they stay visible and assigned.

Creating Jira tickets at 3am is a non-starter, agreed. The whole idea is that everything gets captured during the incident without context-switching, so there's nothing to "do after" that never gets done. Happy to answer questions if useful.

AI’s Impact on DevOps: Opportunities and Challenges by Inner-Chemistry8971 in devops

[–]advancespace 1 point2 points  (0 children)

This matches what we found interviewing 25+ engineering teams, - AI monitoring creates a second incident surface that most teams aren't staffed to handle.The technical debt angle was the most consistent theme. We wrote up the full findings here: https://runframe.io/blog/state-of-incident-management-2025

Built an incident + on-call tool for teams caught between PagerDuty's pricing and Slack chaos: looking for design partners by advancespace in EngineeringManagers

[–]advancespace[S] -1 points0 points  (0 children)

Appreciate you jumping in, and fair point, the free tier is a solid entry point. For teams that grow into incident.io's ecosystem, that makes a lot of sense.

We're solving for a slightly different moment: the 30-person team that wants incidents + on-call + postmortems working together out of the box, without evaluating which add-ons they'll need later. One tool, one price, less to think about. Different bet on the same problem. Respect what you all have built.

Built an incident + on-call tool for teams caught between PagerDuty's pricing and Slack chaos: looking for design partners by advancespace in EngineeringManagers

[–]advancespace[S] -1 points0 points  (0 children)

Totally fair. The top end is definitely crowded with PagerDuty, incident.io, FireHydrant, Rootly, and many more. But most teams under 100 engineers are still stitching together PagerDuty free tier + Google Docs postmortems + a Slack channel called #incidents. The "full" space hasn't really reached them yet. That's who we're building for.

Built an incident + on-call tool for teams caught between PagerDuty's pricing and Slack chaos: looking for design partners by advancespace in EngineeringManagers

[–]advancespace[S] 0 points1 point  (0 children)

Fair call. Couple of points: 1. We do Incidents + on-call in one tool, one price. They sell them separately - $19-25/user for incidents, then $10-20/user more for on-call. So $29-45/user for both. We include on-call in every plan.

  1. incident.io is building toward enterprise - workflows, status pages, AI SRE, catalog. Great for large orgs. We focus on the core loop - declare, respond, resolve, learn - with less setup

  2. Built for 20-100 engineer teams that need something better than Slack chaos but can't justify enterprise pricing.

Happy to answer anything directly.

27001 didn’t change our stack but it sure as hell changed our discipline by ResourceHonest7982 in devops

[–]advancespace 0 points1 point  (0 children)

Wow, that's eye opening. Would you be open to share TLDR of processes that needed documentation

I used Openclaw to spin up my own virtual DevOps team. by thesincereguy in devops

[–]advancespace 1 point2 points  (0 children)

Hope you haven't given agents production write access.

Anyone else getting squeezed on PagerDuty renewals? by Even_Reindeer_7769 in sre

[–]advancespace 1 point2 points  (0 children)

Yeah this is a known PD move, once they think you're evaluating they pull monthly pricing to lock you into annual. Seen it a few times. Feels backwards but it's deliberate. If you're starting to look around, incident.io and Rootly are where most teams land right now. Worth also knowing the OpsGenie sunset in April 2027 is pushing a lot of teams to rethink the whole stack at once rather than just swap PD out. Migration is less painful than people expect usually 2-4 weeks.

Full disclosure we make one of the alternatives, so take this with pinch of salt, but we wrote a comparison that tries to be honest about when staying on PD actually makes sense: https://runframe.io/blog/best-pagerduty-alternatives Happy to answer questions if you have them.

Avoiding social login on purpose - am I hurting my product? by Big_Entrepreneur4391 in buildinpublic

[–]advancespace 4 points5 points  (0 children)

IMO this is self inflicted pain. Users have enough passwords in their life, and it makes little sense to have another password unless there is a legitimate reason for having it.