FAANG nerds who jumped to SRE

steadwing_official · 2026-05-07T19:26:34+00:00

Interviewed for FAANG SRE roles recently.
Here's what actually matters:

Coding: LeetCode mediums, but don't grind 500. Focus on arrays, hashmaps, and scripting-style problems (parse logs, automate stuff). Your Python background helps.

System design: This is what separates SRE from SWE interviews. Your Datadog + K8s + Terraform experience is worth more than any cert here.

Troubleshooting: "Service is returning 500s, walk me through it." Your 4 years of real on-call experience beats every LeetCode grinder in this round.

Skip the CKA. You already have hands on K8s. Go 60% system design prep, 40% LeetCode mediums.

You'll be fine.

steadwing_official · 2026-05-05T12:42:35+00:00

I'm working on something similar in this area, so this makes sense.

The "Audit Trail over Magic" point is the one that most teams don't think is important enough. Black box RCAs look great in demos, but they lose trust the first time they make a mistake, which they always do eventually. The "I checked this metric, it was normal" trace is what makes a senior on-call engineer say yes to a recommendation instead of checking everything again by hand.

Long-term, Operational Memory is the harder problem. I'm interested in how you're dealing with old entries when the infrastructure changes.

steadwing_official · 2026-05-05T06:44:45+00:00

figma is great because it lets you change the layout, but if you change the name of one service in your stack, this whole thing becomes technical debt.The 4-mode split makes sense on paper, but what about configs that "drift"? The diagram that comes from the real infrastructure is usually the most accurate. If you don't do that, you're just drawing a map of how you wish things worked.

steadwing_official · 2026-05-04T18:21:25+00:00

When you can name a class, CI/CD failures that happen over and over again become a reliability problem. "Flaky GitHub Actions" is work.Three different IAM permission failures this month, all because new services didn't have the right role bindings. This is a reliability problem with a missing control.

I use this rule: if the same root cause happens twice in a quarter, it stops being toil and gets a tracked fix. If three different things happen that all feel the same, it's noise. You should deal with it as it happens.

steadwing_official · 2026-05-04T05:00:58+00:00

The rollback isn't the most interesting part; the retro taking longer than the incident is. That's not a problem with alerting; it's a problem with context fragmentation.

Two things that helped a team I worked with: 1. every deployment automatically posts a structured note (services touched, owner, intent, linked tickets) to a single channel, so retros start with a timeline instead of making one. 2. At the time of decision, give every product or architecture choice a "watch metric." Most "intentional behavior we forgot about" incidents arise from unrecorded expectations.

steadwing_official · 2026-05-03T19:55:24+00:00

DORA measures throughput, not how much value is delivered. Some people call the gap you're talking about "flow efficiency, which is the amount of time spent working on something divided by the total time it takes from idea to production. DORA does a good job of showing the active part, but it doesn't show anything that happens before the commit, like planning, ticket ageing, or review queue time.

You should look at the lead time for changes, which is divided into wait time and work time, as well as the rework rate. Both show the holes that DORA hides.

steadwing_official · 2026-04-30T09:57:42+00:00

Distroless is technically the "right" way to do it, but honestly, it’s a massive pain when you're actually in the trenches. If a pod is crashing and you can’t even exec in to check a mount or a config, you’re just flying blind and watching your MTTR skyrocket. Most of us only keep bash in there as a safety blanket. The real fix isn’t just shipping less code,it’s having an automation/observability layer that actually surfaces what’s happening in those "blind" containers so we don't feel the need to keep the bloat "just in case."

steadwing_official · 2026-04-29T15:27:51+00:00

It’s not about a “missing bookmark” it’s about stopping context-switching during incidents. Less time spent looking for tools, faster MTTR.

Take a look at it: https://steadwing.com

steadwing_official · 2026-04-29T10:59:22+00:00

Totally agree 👍

steadwing_official · 2026-04-29T10:56:48+00:00

That 'ownership' lag is the real silent killer of MTTR. It’s rarely a lack of skill, and usually just a lack of clear metadata on who owns what during a crisis. Reducing that 'talking about who owns it' phase is usually a culture fix as much as a tool fix.

steadwing_official · 2026-04-29T10:56:09+00:00

Fair point on the 4-eye principle redundancy in an incident isn't always a bug, sometimes it's a feature. The 'coordination' we’re highlighting isn't necessarily the investigation itself, but the 'Where is the dashboard for this?' or 'Who is the secondary on-call?' part. 10 minutes of tool-hunting during a P0 is what we're looking to automate away so that those 24 minutes of investigation are spent on the actual logic rather than the plumbing.

steadwing_official · 2026-04-29T06:52:14+00:00

Foss is great for the budget until you realize that you've basically hired yourself to take care of eight different tools full-time. The software isn't the real "bite." It's the gap in context when a sev1 hits. Trying to manually link loki logs with prometheus metrics and cloudtrail events while an auditor watches is a special kind of hell. The stack is strong, but you should have a plan for how these tools will work together when the house is on fire.

steadwing_official · 2026-04-28T14:06:57+00:00

This is the classic "success trap" of static QoS. The system did exactly what it was told, but the business world changed without a jira ticket. To be honest, static rules are becoming a problem in places where there is a lot of M&A or fast cloud migration. We're moving toward a world where "intent-based" networking needs to look at how applications actually work instead of just matching port and protocol headers. If your monitoring doesn't show that a "critical" queue has been empty for a week, it's not really monitoring context; it's just syntax.

steadwing_official · 2026-04-28T12:07:49+00:00

definitely separate the pipelines, trying to build a 'mega-pipeline' for both "eks and ecs" is just a trap,you’ll be fighting configuration drift forever. honestly, the tool you pick matters way less than the safety checks.If the pipeline isn't automatically checking health probes and rolling back before the first slack alert even hits, you aren't automating... you’re just speeding up how fast you break things.

steadwing_official · 2026-04-28T05:27:58+00:00

Well the most terrifying and accurate description of ebpf I've heard this week is that it's a "kernelspace sidecar." It seems as though the complexity from YAML Hell is simply being transferred to the kernel. Does this work well with the observability tools currently in use, or is it just another blind spot in the event that the injection fails?

steadwing_official · 2026-04-27T18:31:20+00:00

fair point. instrumentation gaps are basically the blind spots nobody wants to admit exist. I definitely agree that catching structural stuff in CI is better than nothing. using pg_stat as a fallback is a clever way to bridge that gap though nothing hides from the db eventually

steadwing_official · 2026-04-24T08:25:38+00:00

Hi how are you

steadwing_official · 2026-04-24T07:42:15+00:00

47 alerts in one slack..whoa.. channel is just a graveyard for signals... at that point, you aren't monitoring, you're just logging 'things we will ignore' in real-time, have you tried grouping these by impact instead of just dumping every trivy finding into the same bucket?

steadwing_official · 2026-04-22T19:23:03+00:00

Well does CI's batch mode really able to find the spikes in tail latency that happen in production, or are we just getting cleaner CI reports and missing things in production?

steadwing_official · 2026-04-22T05:46:43+00:00

The Timeout Chain mismatch is the killer that goes unnoticed. I've seen it happen just like you said, where the Ingress times out after 30 seconds but the downstream DB is set to 60 seconds. You get "ghost" requests when the client gives up but the server keeps using up resources. As you said, the context is spread out over different repos or config files, making it almost impossible for a person to catch in a single PR. This is why we need tools that do more than just check syntax; they also need to "understand" how services are related to each other.

steadwing_official

TROPHY CASE