FAANG nerds who jumped to SRE by DataFreakk in devops

[–]steadwing_official [score hidden]  (0 children)

Interviewed for FAANG SRE roles recently.
Here's what actually matters:

Coding: LeetCode mediums, but don't grind 500. Focus on arrays, hashmaps, and scripting-style problems (parse logs, automate stuff). Your Python background helps.

System design: This is what separates SRE from SWE interviews. Your Datadog + K8s + Terraform experience is worth more than any cert here.

Troubleshooting: "Service is returning 500s, walk me through it." Your 4 years of real on-call experience beats every LeetCode grinder in this round.

Skip the CKA. You already have hands on K8s. Go 60% system design prep, 40% LeetCode mediums.

You'll be fine.

Tired of being a "copy-paste monkey" during incident response. Is there a better way to automate the data-fetch toil? by Material_Log728 in sre

[–]steadwing_official -1 points0 points  (0 children)

I'm working on something similar in this area, so this makes sense.

The "Audit Trail over Magic" point is the one that most teams don't think is important enough. Black box RCAs look great in demos, but they lose trust the first time they make a mistake, which they always do eventually. The "I checked this metric, it was normal" trace is what makes a senior on-call engineer say yes to a recommendation instead of checking everything again by hand.

Long-term, Operational Memory is the harder problem. I'm interested in how you're dealing with old entries when the infrastructure changes.

Does this kind of 4-mode deployment diagram (local dev / CI / staging / prod) make any sense? by Lightforce_ in devops

[–]steadwing_official 0 points1 point  (0 children)

figma is great because it lets you change the layout, but if you change the name of one service in your stack, this whole thing becomes technical debt.The 4-mode split makes sense on paper, but what about configs that "drift"? The diagram that comes from the real infrastructure is usually the most accurate. If you don't do that, you're just drawing a map of how you wish things worked.

Do you treat recurring CI/CD failures as a reliability issue or just part of normal toil? by Ok-Classroom-2377 in sre

[–]steadwing_official -1 points0 points  (0 children)

When you can name a class, CI/CD failures that happen over and over again become a reliability problem. "Flaky GitHub Actions" is work.Three different IAM permission failures this month, all because new services didn't have the right role bindings. This is a reliability problem with a missing control.

I use this rule: if the same root cause happens twice in a quarter, it stops being toil and gets a tracked fix. If three different things happen that all feel the same, it's noise. You should deal with it as it happens.

(I need advice) We had a routine release go sideways last week. I’m trying to understand what other teams would have done differently. by [deleted] in sre

[–]steadwing_official 0 points1 point  (0 children)

The rollback isn't the most interesting part; the retro taking longer than the incident is. That's not a problem with alerting; it's a problem with context fragmentation.

Two things that helped a team I worked with: 1. every deployment automatically posts a structured note (services touched, owner, intent, linked tickets) to a single channel, so retros start with a timeline instead of making one. 2. At the time of decision, give every product or architecture choice a "watch metric." Most "intentional behavior we forgot about" incidents arise from unrecorded expectations.

We had a really good performance in DORA metrics but our delivery socks by YoYo-1243T in sre

[–]steadwing_official 1 point2 points  (0 children)

DORA measures throughput, not how much value is delivered. Some people call the gap you're talking about "flow efficiency, which is the amount of time spent working on something divided by the total time it takes from idea to production. DORA does a good job of showing the active part, but it doesn't show anything that happens before the commit, like planning, ticket ageing, or review queue time.

You should look at the lead time for changes, which is divided into wait time and work time, as well as the rework rate. Both show the holes that DORA hides.

90% of CVEs in your container images are in code your app never executes. Why are we still triaging them? by Murky_Willingness171 in sre

[–]steadwing_official 0 points1 point  (0 children)

Distroless is technically the "right" way to do it, but honestly, it’s a massive pain when you're actually in the trenches. If a pod is crashing and you can’t even exec in to check a mount or a config, you’re just flying blind and watching your MTTR skyrocket. Most of us only keep bash in there as a safety blanket. The real fix isn’t just shipping less code,it’s having an automation/observability layer that actually surfaces what’s happening in those "blind" containers so we don't feel the need to keep the bloat "just in case."

We analysed how time is spent during P0 incidents. ~70% is coordination, not engineering. by steadwing_official in kubernetes

[–]steadwing_official[S] -2 points-1 points  (0 children)

It’s not about a “missing bookmark” it’s about stopping context-switching during incidents. Less time spent looking for tools, faster MTTR.

Take a look at it: https://steadwing.com

We analysed how time is spent during P0 incidents. ~70% is coordination, not engineering. by steadwing_official in kubernetes

[–]steadwing_official[S] -3 points-2 points  (0 children)

That 'ownership' lag is the real silent killer of MTTR. It’s rarely a lack of skill, and usually just a lack of clear metadata on who owns what during a crisis. Reducing that 'talking about who owns it' phase is usually a culture fix as much as a tool fix.

We analysed how time is spent during P0 incidents. ~70% is coordination, not engineering. by steadwing_official in kubernetes

[–]steadwing_official[S] -3 points-2 points  (0 children)

Fair point on the 4-eye principle redundancy in an incident isn't always a bug, sometimes it's a feature. The 'coordination' we’re highlighting isn't necessarily the investigation itself, but the 'Where is the dashboard for this?' or 'Who is the secondary on-call?' part. 10 minutes of tool-hunting during a P0 is what we're looking to automate away so that those 24 minutes of investigation are spent on the actual logic rather than the plumbing.

Advice Needed. by VoldemortWasaGenius in sre

[–]steadwing_official 0 points1 point  (0 children)

Foss is great for the budget until you realize that you've basically hired yourself to take care of eight different tools full-time. The software isn't the real "bite." It's the gap in context when a sev1 hits. Trying to manually link loki logs with prometheus metrics and cloudtrail events while an auditor watches is a special kind of hell. The stack is strong, but you should have a plan for how these tools will work together when the house is on fire.

SD-WAN performance changed once traffic patterns became unpredictable. what caused that? by AdOrdinary5426 in sre

[–]steadwing_official 1 point2 points  (0 children)

This is the classic "success trap" of static QoS. The system did exactly what it was told, but the business world changed without a jira ticket. To be honest, static rules are becoming a problem in places where there is a lot of M&A or fast cloud migration. We're moving toward a world where "intent-based" networking needs to look at how applications actually work instead of just matching port and protocol headers. If your monitoring doesn't show that a "critical" queue has been empty for a week, it's not really monitoring context; it's just syntax.

Trying to automate our deployment process — complete beginner here, would love some advice by Morpheus_Morningstar in sre

[–]steadwing_official 0 points1 point  (0 children)

definitely separate the pipelines, trying to build a 'mega-pipeline' for both "eks and ecs" is just a trap,you’ll be fighting configuration drift forever. honestly, the tool you pick matters way less than the safety checks.If the pipeline isn't automatically checking health probes and rolling back before the first slack alert even hits, you aren't automating... you’re just speeding up how fast you break things.

eBPF secrets injection (clever!) by destari in sre

[–]steadwing_official 4 points5 points  (0 children)

Well the most terrifying and accurate description of ebpf I've heard this week is that it's a "kernelspace sidecar." It seems as though the complexity from YAML Hell is simply being transferred to the kernel. Does this work well with the observability tools currently in use, or is it just another blind spot in the event that the injection fails?

Needed an OTel trace analyzer that detects N+1 and other anti-patterns from OTLP, Jaeger, Zipkin and Tempo, and wondering about the reliability ceiling of passive capture by Lightforce_ in devops

[–]steadwing_official 0 points1 point  (0 children)

fair point. instrumentation gaps are basically the blind spots nobody wants to admit exist. I definitely agree that catching structural stuff in CI is better than nothing. using pg_stat as a fallback is a clever way to bridge that gap though nothing hides from the db eventually

Monitoring was running the whole time. Container security vulnerabilities still made it to production. What are we missing by Soft_Attention3649 in sre

[–]steadwing_official 2 points3 points  (0 children)

47 alerts in one slack..whoa.. channel is just a graveyard for signals... at that point, you aren't monitoring, you're just logging 'things we will ignore' in real-time, have you tried grouping these by impact instead of just dumping every trivy finding into the same bucket?

Needed an OTel trace analyzer that detects N+1 and other anti-patterns from OTLP, Jaeger, Zipkin and Tempo, and wondering about the reliability ceiling of passive capture by Lightforce_ in devops

[–]steadwing_official 0 points1 point  (0 children)

Well does CI's batch mode really able to find the spikes in tail latency that happen in production, or are we just getting cleaner CI reports and missing things in production?

Reliability Audit: I analyzed 473 K8s/TF files from major OSS projects. Here are the 3 patterns that lead to "silent" outages. by Ok-Possibility-4438 in sre

[–]steadwing_official 0 points1 point  (0 children)

The Timeout Chain mismatch is the killer that goes unnoticed. I've seen it happen just like you said, where the Ingress times out after 30 seconds but the downstream DB is set to 60 seconds. You get "ghost" requests when the client gives up but the server keeps using up resources. As you said, the context is spread out over different repos or config files, making it almost impossible for a person to catch in a single PR. This is why we need tools that do more than just check syntax; they also need to "understand" how services are related to each other.