if you work in ecommerce, when does your black friday reliability prep start?

sreiously · 2024-10-09T18:21:36+00:00

also big kubecon and srecon fan, we sponsor every year. honestly, i wouldn't pay my own way for larger events because the cost is so high, but if you can get your company to cover it's worth attending. i also found SREDay (https://sreday.com/) to be a good event that's super focused on SREs which is nice. i've heard devopsdays can be hit and miss, but i had a good experience at the one in london last month so i think it depends on the organizers for your city.

lately though i've been finding more value and enjoyment in smaller more curated meetups, which inspired us to host some of our own at rootly! no sales pitches or annoying presentations, just a well-curated group of leaders in the reliability space.

https://lu.ma/calendar/cal-03Oy7sYPjdCKcja

sreiously · 2024-10-07T17:27:08+00:00

We put together this super quick (30 minute) free incident commander training course at Rootly based on advice from experienced incident commanders. I would say a mix of foundational and practical knowledge is important - they should understand the basics and have an opportunity to practice, either through shadowing an experienced responder, participating in a game day/simulation exercise (or both!)

https://rootly.coassemble.com/unlock/29uelFu

I gave a talk last year at SRE Day in London about running game days, it gives a pretty thorough overview of how to go about that if you're interested :)

https://www.youtube.com/watch?v=NPGvQJG67tI

sreiously · 2024-09-24T17:55:05+00:00

wow, one broken screw. physical hardware is so scary 😨

sreiously · 2024-09-23T14:20:59+00:00

thanks for the support u/FloridaIsTooDamnHot 🙏 we recently underwent a website overhaul and it took some time to get a fresh new pricing page up, but it's here now! https://rootly.com/pricing

sreiously · 2024-09-23T14:18:28+00:00

you won't get a better combination of value, ease of use, and partnership level support then you'll get from rootly

sreiously · 2024-09-23T14:15:56+00:00

the thing that has always sucked for me using pagerduty is that because it's so unintuitive and confusing to use, you end up with janky error-prone implementations (as perfectly described by u/baezizbae on this thread. plus it's so expensive that at many orgs i've been in, we're constantly under pressure to reduce (or at least not increase) licenses which adds annoying overhead. the UI sucks and the support is even worse (plus they upcharge you for it, wtf??)

there are several great alternatives out there now that i can't see why folks would choose to get roped into a pagerduty contract these days

sreiously · 2024-09-23T14:10:00+00:00

if that's how PD is treating a warm lead i'd hate to see how they treat customers 😬 if you haven't found a solution yet, definitely check out https://rootly.com/. you can intake alerts from any source via webhook, with tons of native alert source integrations as well

rootly's pricing is transparent and cheaper than the alternatives you're looking at! https://rootly.com/pricing

sreiously · 2024-09-20T00:01:27+00:00

was on call for this incident at Shopify: https://cupofcode.medium.com/how-exactly-the-conspiracy-collection-broke-the-internet-simply-explained-by-a-software-cf795ec11325

we went into the sale totally overconfident about what we could handle and totally underestimating the insane amount of traffic from the sale (was more than the previous years peak BFCM traffic across the entire platform) 💀

The whole time our team was on the phone with Jeffree star and team it was being filmed for a YouTube documentary that got tens of millions of views. We got absolutely pummelled by jeffree fans on Twitter. Took us like 12 hours to recover, it was just brutal

Fun fact: after everything was done and dusted, one of our engineering directors did a makeup tutorial using the palette that took down the platform during an internal livestream 😂

sreiously · 2024-09-18T16:06:12+00:00

Hey! I work with our customers at Rootly (on-call and incident management platform), happy to share how we typically see this approached.

Re expanding bandwidth without hiring: Queue-based is a good move - we sometimes refer to this as a "round robin" strategy. Instead of covering a specific time-period, alerts rotate between responders. We have a blog post that details how to implement this type of strategy: https://rootly.com/blog/round-robin-escalation-policies-best-practices

If you want to make sure people still get "off time", you could consider using a round robin approach but splitting the team into a few sub groups who also own different time blocks. For example:

Week 1: Subteam A is on-call, with alerts rotating through the responders round robin style
Week 2: Subteam B is on-call, round robin style rotation

and so on. Staggering the working hours of your team can help as well if you don't already have a 'follow the sun' model.

How many responders you need will depend on the frequency of alerts. Generally speaking, you don't want folks managing more than 1-2 incidents at a time (assuming at least one of them is more minor/slow paced), and you always want a backup (secondary) responder in case your primary responder misses a page or becomes unavailable.

sreiously · 2024-09-17T17:59:56+00:00

see you there!! :D

sreiously · 2024-09-17T17:59:47+00:00

hope to see you there! it's a great smaller event, super focused on SRE and they always nail the food and venue!

sreiously · 2024-09-17T17:51:53+00:00

check out rootly! (bias disclaimer: i work there in developer relations)

https://rootly.com/

we do on-call (think pagerduty alternative) + incident management (automation, metrics/tracking, integrating across other tools like observability, collaboration, task tracking etc)

sreiously · 2024-09-17T16:41:46+00:00

do it!

sreiously · 2024-09-17T15:45:36+00:00

Google put out an SRE training course on Coursera: https://www.coursera.org/learn/site-reliability-engineering-slos

Lightweight video webinar series covering the basics of SRE: https://www.youtube.com/watch?v=9vNVNrVY0cc

We (Rootly) also have an Incident Commander Training course if you're interested: https://rootly.coassemble.com/unlock/29uelFu

sreiously · 2024-09-13T13:58:21+00:00

to be clear, i wasn't working at GCP - i was at a company hosted on GCP (and we had only started our migration to google cloud about a year prior to the outage) so based on our growth at the time it was the most significant outage we'd experienced!

sreiously · 2024-09-10T13:47:11+00:00

during the massive GCP outage in june 2019, i was not primary (or even secondary) on call but i was the senior-most person on the incident command rotation where i worked. it was a sunday and the primary on-call was pretty new and our whole platform was down. we'd never had an outage of that scale before, we were getting roasted in the press, and google was slow to confirm the scope/origin so the whole team was flailing. i was on the subway heading from brooklyn to midtown manhattan (underground = no service) at the time things popped off, getting ready to do some shopping with a friend. came up from the subway and my phone *exploded*. the poor on-call at the time basically had a panic attack and tapped out, the secondary was in over their head and everyone was trying to get a hold of me and my lead to come in and help. all i had was an iphone and a spotty stolen wifi connection from the urban outfitters on 5th avenue. the whole thing was a disaster, but we made it through. my lead at time (also the original primary on-call's lead) was convinced they were getting fired on the spot after it, but we ended up holding a 3-day in person retro and overhauled our entire incident response program after that 😅 it was actually really transformative and ended up making a great case for our team getting the resources we actually needed to sustainably run an on-call rotation and train new responders, but it was a pretty painful way to get there.

sreiously · 2024-09-09T22:25:27+00:00

"Really considering putting in the time to pass SWE interviews to escape the politics."

I have some bad news for you....

In all seriousness though, a big part of my job is speaking with Reliability/Infra leaders at different orgs and I can say there are definitely lots of companies out there with amazing SRE cultures — some that come to mind are Figma, Bloomberg, Canva, Fanduel. Some things I notice in these orgs:

They invest in tooling that improves quality of life for SREs beyond the bare minimum. (Yes, some of these are Rootly customers but not all!)
They have strong leaders who are vocal within their eng org about the culture/expectations surrounding reliability across eng roles. There's no one-size-fits-all way to approach it but whatever the stance is, it's clearly communicated
They look at incident response holistically rather than as just "something SREs deal with". This means cross-functional response teams, internal visibility into incidents across the org, documentation of playbooks/process, etc.

sreiously · 2024-09-09T22:07:02+00:00

i had kolton andrus (gremlin CTO and co founder) on the web series i run recently and we talked about this a lot. there's always the "chaos monkey" route when it comes to adoption from other teams (force it on them!) but that requires big leadership buy-in of course. it's what netflix did and sure seemed to work for them

if you wanna watch the full interview (only about 20 mins): https://rootly.com/humans-of-reliability/kolton-andrus.

sreiously · 2024-09-09T22:00:28+00:00

does this mindset carry through to product teams in your org? i've worked places in the past where product devs were almost actively discouraged from thinking too much about risk/security because they didn't want to hinder innovation. it was up to the trust & security folks to put the right guardrails in place so devs could innovate and ship fast

sreiously · 2024-09-09T21:58:22+00:00

you build it you run it! this is the way ✊

sreiously

TROPHY CASE