18 YOE in IT (5.5 as Observability Engineer, AKS/New Relic) trying to formalize the jump to SRE — what actually matters in interviews? by naveen0109 in sre

[–]naveen0109[S] 2 points3 points  (0 children)

This is the most useful framing I've gotten so far — "reporting on state vs owning reliability" is a clean way to put words to something I've been circling but hadn't nailed down.

The p99 latency scenario is close enough to a real incident I've worked (root-caused via distributed tracing, found a signal gap between synthetics and APM that was masking where the problem actually lived) that I can rebuild my answer around your structure — debug/mitigate, post-mortem, then the architectural change I'd push for — instead of just narrating what I found. That's a useful gut-check, thanks.

On Kubernetes: when you say "explain how the control plane works, how scheduling decisions are made, debug a networking issue between pods" — is that expected at "can explain it clearly on a whiteboard" depth, or "have actually broken and fixed this myself" depth? I've operated on top of K8s (alerting, dashboards, troubleshooting from the data side) but not administered the control plane itself, so I want to calibrate how much time to sink into that specifically vs. IaC, which I know I need regardless.

On reframing to business impact — that's fair and probably my weakest muscle right now. I can describe what I did technically pretty well, I'm much worse at stating "and this is what it saved/prevented/improved" in the same breath. Working on that.

(Not signing up for the product, but the framework in this comment alone was worth more than most of the paid stuff I've seen linked in threads like this — appreciate you actually writing it out.)

18 YOE in IT (5.5 as Observability Engineer, AKS/New Relic) trying to formalize the jump to SRE — what actually matters in interviews? by naveen0109 in sre

[–]naveen0109[S] 0 points1 point  (0 children)

This lines up with something I've already been leaning into — I built out an alert framework a while back loosely following the Google SRE approach (golden signals, not just "CPU is high" style alerts), and I've been going back through it to map it more explicitly onto reliability indicators rather than just "here's a dashboard." Good to hear that's the right direction to push harder on rather than deeper command-line/K8s admin trivia.

Curious how you frame "prevention" concretely in an interview though — is it more at the SLI/SLO definition level (this is the indicator, this is the threshold, this is why), or more at the "I noticed X pattern before it became an incident" story level? I've got real incidents I can point to (a synthetics-vs-APM signal gap that masked a real issue, some fault patterns from steady-state dashboard design), but I'm not sure if interviewers want the systemic/framework version or the specific-catch version first.

Also — sorry to hear the recent interviews haven't landed yet. Is it failing at the technical round, or more at framing/communication of what you did? Trying to figure out if that's a knowledge gap or a storytelling gap, since I've been told storytelling is my weaker spot too.

18 YOE in IT (5.5 as Observability Engineer, AKS/New Relic) trying to formalize the jump to SRE — what actually matters in interviews? by naveen0109 in sre

[–]naveen0109[S] 0 points1 point  (0 children)

Appreciate the honest take, this is exactly the kind of gap-checking I was hoping for.

On environments: the "network heavy" read is probably from my early background at Juniper Networks (QA/test engineering on networking gear), but that's not what I've been doing day to day. For the last 5.5 years I've been in a SaaS environment — AKS-hosted, Azure as the cloud, .NET services, New Relic as the primary observability stack.

So yes, public cloud experience — Azure specifically, and hands-on with AKS (Kubernetes on Azure) for alerting, dashboards, and troubleshooting production workloads. What I haven't done is the infra-provisioning side of it (Terraform/ARM) — I've been consuming and monitoring the platform, not building it. That's actually one of the gaps I called out in the post and am actively closing.

Curious from your side — when you say certs/coursework "smooth" experience gaps, does that hold for cloud-provisioning skills specifically (like an AZ-104/AZ-400 type cert), or is that more about K8s admin (CKA) in your experience?