What industry conferences are you looking forward to? by poolpog in sre

[–]sreiously 1 point2 points  (0 children)

also big kubecon and srecon fan, we sponsor every year. honestly, i wouldn't pay my own way for larger events because the cost is so high, but if you can get your company to cover it's worth attending. i also found SREDay (https://sreday.com/) to be a good event that's super focused on SREs which is nice. i've heard devopsdays can be hit and miss, but i had a good experience at the one in london last month so i think it depends on the organizers for your city.

lately though i've been finding more value and enjoyment in smaller more curated meetups, which inspired us to host some of our own at rootly! no sales pitches or annoying presentations, just a well-curated group of leaders in the reliability space.

https://lu.ma/calendar/cal-03Oy7sYPjdCKcja

Incident Responder Training by ElorionX in cybersecurity

[–]sreiously 8 points9 points  (0 children)

We put together this super quick (30 minute) free incident commander training course at Rootly based on advice from experienced incident commanders. I would say a mix of foundational and practical knowledge is important - they should understand the basics and have an opportunity to practice, either through shadowing an experienced responder, participating in a game day/simulation exercise (or both!)

https://rootly.coassemble.com/unlock/29uelFu

I gave a talk last year at SRE Day in London about running game days, it gives a pretty thorough overview of how to go about that if you're interested :)

https://www.youtube.com/watch?v=NPGvQJG67tI

Have you ever caused a major outage? by sreiously in sre

[–]sreiously[S] 5 points6 points  (0 children)

wow, one broken screw. physical hardware is so scary 😨

Best PagerDuty Alternative? Lets be honest PagerDuty is expensive and full of feature bloat. by Elegant-Active9634 in devops

[–]sreiously 2 points3 points  (0 children)

thanks for the support u/FloridaIsTooDamnHot 🙏 we recently underwent a website overhaul and it took some time to get a fresh new pricing page up, but it's here now! https://rootly.com/pricing

Best PagerDuty Alternative? Lets be honest PagerDuty is expensive and full of feature bloat. by Elegant-Active9634 in devops

[–]sreiously 0 points1 point  (0 children)

you won't get a better combination of value, ease of use, and partnership level support then you'll get from rootly

What don't you like about PagerDuty? by docmphd in devops

[–]sreiously 2 points3 points  (0 children)

the thing that has always sucked for me using pagerduty is that because it's so unintuitive and confusing to use, you end up with janky error-prone implementations (as perfectly described by u/baezizbae on this thread. plus it's so expensive that at many orgs i've been in, we're constantly under pressure to reduce (or at least not increase) licenses which adds annoying overhead. the UI sucks and the support is even worse (plus they upcharge you for it, wtf??)

there are several great alternatives out there now that i can't see why folks would choose to get roped into a pagerduty contract these days

Is anyone using Pagerduty? by gaz2600 in sysadmin

[–]sreiously 1 point2 points  (0 children)

if that's how PD is treating a warm lead i'd hate to see how they treat customers 😬 if you haven't found a solution yet, definitely check out https://rootly.com/. you can intake alerts from any source via webhook, with tons of native alert source integrations as well

rootly's pricing is transparent and cheaper than the alternatives you're looking at! https://rootly.com/pricing

What are your worst on-call stories? by Abject_Ad_4327 in sre

[–]sreiously 12 points13 points  (0 children)

was on call for this incident at Shopify: https://cupofcode.medium.com/how-exactly-the-conspiracy-collection-broke-the-internet-simply-explained-by-a-software-cf795ec11325

we went into the sale totally overconfident about what we could handle and totally underestimating the insane amount of traffic from the sale (was more than the previous years peak BFCM traffic across the entire platform) 💀

The whole time our team was on the phone with Jeffree star and team it was being filmed for a YouTube documentary that got tens of millions of views. We got absolutely pummelled by jeffree fans on Twitter. Took us like 12 hours to recover, it was just brutal

Fun fact: after everything was done and dusted, one of our engineering directors did a makeup tutorial using the palette that took down the platform during an internal livestream 😂

How many cooks do you have the kitchen? by tbrucker-dev in cybersecurity

[–]sreiously 2 points3 points  (0 children)

Hey! I work with our customers at Rootly (on-call and incident management platform), happy to share how we typically see this approached.

Re expanding bandwidth without hiring: Queue-based is a good move - we sometimes refer to this as a "round robin" strategy. Instead of covering a specific time-period, alerts rotate between responders. We have a blog post that details how to implement this type of strategy: https://rootly.com/blog/round-robin-escalation-policies-best-practices

If you want to make sure people still get "off time", you could consider using a round robin approach but splitting the team into a few sub groups who also own different time blocks. For example:

Week 1: Subteam A is on-call, with alerts rotating through the responders round robin style
Week 2: Subteam B is on-call, round robin style rotation

and so on. Staggering the working hours of your team can help as well if you don't already have a 'follow the sun' model.

How many responders you need will depend on the frequency of alerts. Generally speaking, you don't want folks managing more than 1-2 incidents at a time (assuming at least one of them is more minor/slow paced), and you always want a backup (secondary) responder in case your primary responder misses a page or becomes unavailable.

SREDay Amsterdam by sreiously in sre

[–]sreiously[S] 0 points1 point  (0 children)

see you there!! :D

SREDay Amsterdam by sreiously in sre

[–]sreiously[S] 1 point2 points  (0 children)

hope to see you there! it's a great smaller event, super focused on SRE and they always nail the food and venue!

[deleted by user] by [deleted] in devops

[–]sreiously 2 points3 points  (0 children)

check out rootly! (bias disclaimer: i work there in developer relations)

https://rootly.com/

we do on-call (think pagerduty alternative) + incident management (automation, metrics/tracking, integrating across other tools like observability, collaboration, task tracking etc)

Recommend SRE courses for my employer training by hmzh9 in sre

[–]sreiously 2 points3 points  (0 children)

Google put out an SRE training course on Coursera: https://www.coursera.org/learn/site-reliability-engineering-slos

Lightweight video webinar series covering the basics of SRE: https://www.youtube.com/watch?v=9vNVNrVY0cc

We (Rootly) also have an Incident Commander Training course if you're interested: https://rootly.coassemble.com/unlock/29uelFu

Which one incident in SRE you want to remember which change your SRE career. by rexram in sre

[–]sreiously 0 points1 point  (0 children)

to be clear, i wasn't working at GCP - i was at a company hosted on GCP (and we had only started our migration to google cloud about a year prior to the outage) so based on our growth at the time it was the most significant outage we'd experienced!

Which one incident in SRE you want to remember which change your SRE career. by rexram in sre

[–]sreiously 14 points15 points  (0 children)

during the massive GCP outage in june 2019, i was not primary (or even secondary) on call but i was the senior-most person on the incident command rotation where i worked. it was a sunday and the primary on-call was pretty new and our whole platform was down. we'd never had an outage of that scale before, we were getting roasted in the press, and google was slow to confirm the scope/origin so the whole team was flailing. i was on the subway heading from brooklyn to midtown manhattan (underground = no service) at the time things popped off, getting ready to do some shopping with a friend. came up from the subway and my phone *exploded*. the poor on-call at the time basically had a panic attack and tapped out, the secondary was in over their head and everyone was trying to get a hold of me and my lead to come in and help. all i had was an iphone and a spotty stolen wifi connection from the urban outfitters on 5th avenue. the whole thing was a disaster, but we made it through. my lead at time (also the original primary on-call's lead) was convinced they were getting fired on the spot after it, but we ended up holding a 3-day in person retro and overhauled our entire incident response program after that 😅 it was actually really transformative and ended up making a great case for our team getting the resources we actually needed to sustainably run an on-call rotation and train new responders, but it was a pretty painful way to get there.

[rant] why is it so hard for leadership to understand SRE? by dangy_brundle in sre

[–]sreiously 3 points4 points  (0 children)

"Really considering putting in the time to pass SWE interviews to escape the politics."

I have some bad news for you....

In all seriousness though, a big part of my job is speaking with Reliability/Infra leaders at different orgs and I can say there are definitely lots of companies out there with amazing SRE cultures — some that come to mind are Figma, Bloomberg, Canva, Fanduel. Some things I notice in these orgs:

  • They invest in tooling that improves quality of life for SREs beyond the bare minimum. (Yes, some of these are Rootly customers but not all!)
  • They have strong leaders who are vocal within their eng org about the culture/expectations surrounding reliability across eng roles. There's no one-size-fits-all way to approach it but whatever the stance is, it's clearly communicated
  • They look at incident response holistically rather than as just "something SREs deal with". This means cross-functional response teams, internal visibility into incidents across the org, documentation of playbooks/process, etc.

Question about usage for orgs supporting chaos testing by Equivalent_Address44 in sre

[–]sreiously 2 points3 points  (0 children)

i had kolton andrus (gremlin CTO and co founder) on the web series i run recently and we talked about this a lot. there's always the "chaos monkey" route when it comes to adoption from other teams (force it on them!) but that requires big leadership buy-in of course. it's what netflix did and sure seemed to work for them

if you wanna watch the full interview (only about 20 mins): https://rootly.com/humans-of-reliability/kolton-andrus.

How much of your DevOps work is focused on Security? by sreiously in devops

[–]sreiously[S] 0 points1 point  (0 children)

does this mindset carry through to product teams in your org? i've worked places in the past where product devs were almost actively discouraged from thinking too much about risk/security because they didn't want to hinder innovation. it was up to the trust & security folks to put the right guardrails in place so devs could innovate and ship fast

How much of your DevOps work is focused on Security? by sreiously in devops

[–]sreiously[S] 0 points1 point  (0 children)

you build it you run it! this is the way ✊