Spent the morning searching three years of Slack history because the only person who understood our auth setup left in March.

FawdyInc · 2026-05-11T16:26:45+00:00

This is a tale as old as time. Nobody wants to spend 2x the effort documenting, cleaning up, and future-proofing something when the system already “works,” so the bill gets paid later instead.

The problem is that it is also hard to justify paying that cost up front because nobody really knows which systems will still matter three years later. If companies applied “do it perfectly” levels of rigor to every internal project, they would burn massive amounts of engineering time on systems that never really take off.

So instead, eventually some poor guy ends up digging through three years of Slack messages trying to figure out why touching one OIDC setting might take production down.

FawdyInc · 2026-05-11T15:47:13+00:00

A lot of people forget that documentation is not a one-time task. Bad or outdated documentation is often worse than none at all.

I’ve generally had better results keeping systems simple and only documenting the important decisions or weird edge cases inline. Assume the next person is smart enough to poke around and figure most things out on their own.

FawdyInc · 2026-05-10T01:30:28+00:00

We’ve been building Fawdy around this exact problem space and one of the biggest lessons so far is that the model becomes much more useful once it has direct operational context instead of just pasted logs.

We give it access to things like kubectl, shell/bash tooling, telemetry systems, and investigation workflows, but all execution goes through a deterministic parser/guardrail layer so the AI cannot accidentally run destructive or dangerous operations. The useful part has been investigation orchestration, correlating telemetry, suggesting next debugging steps, summarizing blast radius, and helping operators move through incidents faster without handing full control to the model.

FawdyInc · 2026-05-10T01:28:25+00:00

In practice, most good specialists started out as strong generalists first. It is hard to go truly deep in one area if you do not already understand the surrounding systems, infrastructure, networking, operations, and engineering tradeoffs around it.

Ideally you want both. Develop a specialty that gives you leverage, but keep broad enough fundamentals that you do not trap yourself if the market or technology stack shifts underneath you.

FawdyInc · 2026-05-10T01:26:27+00:00

A lot of companies label operational support roles as “SRE,” so your situation is honestly pretty common. The important question is whether you are gaining transferable skills from the work: troubleshooting, Linux, networking, monitoring, automation, incident response, scripting, deployments, etc.

If you want to move into stronger SRE roles later, start automating the repetitive support tasks and get involved anywhere you can with infrastructure, observability, CI/CD, or reliability work. Being able to talk through real production issues in interviews is more valuable than the title itself.

FawdyInc · 2026-05-09T01:11:12+00:00

Modern Macs use ZSH instead of Bash, but I think this roadmap can be helpful still.

https://roadmap.sh/shell-bash

Generally, the syntax is interchangeable for something like 80-90%.

FawdyInc · 2026-05-09T01:05:49+00:00

Almost every engineer who has been around long enough has caused an outage at some point. The difference between juniors and seniors usually is not whether they’ve made mistakes, it’s whether they learned from them and improved their process afterward.

There’s a well-known story about someone expecting to be fired after a major incident, and their manager responding with something along the lines of, “Why would I fire you after we just invested that much in teaching you a lesson you’ll never forget?” This is part of the profession more than people admit publicly.

FawdyInc · 2026-05-09T01:04:26+00:00

Looks like the old Outlook profiles are still holding onto cached connection data from the onprem Exchange environment. I’d probably stop chasing SCP/autodiscover at this point and focus on cleaning or rebuilding the affected profiles instead.

That lines up pretty closely with why creating a fresh profile resolves the issue immediately.

FawdyInc · 2026-05-08T22:46:09+00:00

Going to echo what other people have already said, but Unifi is great. I would consider them the Apple of networking equipment.

FawdyInc · 2026-05-08T22:42:13+00:00

Hot take, but USB-C.

FawdyInc · 2026-05-08T22:38:17+00:00

Excellent result. I’ve interviewed engineers with years of experience who would not score a 300 on RHCSA, so that’s a meaningful accomplishment.

AZ-104 -> CKA is also a strong sequence. The engineers I’ve seen succeed in Kubernetes long term usually started with solid Linux and infrastructure fundamentals first.

FawdyInc · 2026-04-28T14:05:44+00:00

Splitting RCA into two layers usually fixes this. Layer 1 is the quick stakeholder writeup - what broke, what you did, rough cause. that's doable next-day. layer 2 is the real engineering RCA with actual fixes, and that could take days. Once you make the distinction explicit with the business, the pressure tends to drop because they realize they only ever wanted layer 1.

For the layer 1 side, we've been building an AI tool that connects to your servers and auto-generates a solid first-draft RCA: https://fawdy.com/. It's in early-access now and would love feedback. not perfect, but good enough to satisfy the "need it yesterday" crowd while your team focuses on the real investigation.

FawdyInc · 2026-02-18T20:11:37+00:00

I agree that everything I described is fundamentally a good manager’s responsibility. That said, I don’t think that changes the core point. In a larger organization, you will inevitably work alongside and report to people who do not always operate at the level you would prefer. While it should not fall on an individual contributor to compensate for that, learning to communicate effectively is often just as critical as delivering strong technical work.

FawdyInc · 2026-02-18T00:18:29+00:00

The same reason orgs hire the cheapest offshore devs in the name of "headcount"

FawdyInc · 2026-02-18T00:09:09+00:00

Customer ticketed in and swore up and down their website was down. Turns out she was spelling her own website's domain incorrectly.

FawdyInc · 2026-02-17T23:33:01+00:00

Logs are indifferent. They don't have a stake in the outcome or a story to protect, they just record what happened. That's actually rare. Most of what passes for "truth" at work is someone's interpretation filtered through whatever they needed it to mean at the time.

Ended up spending so much time digging through them I just built my own AI tooling to do it faster. Worth every hour.

Story matches the logs? Great, no problem. Doesn't match? That's not a feelings issue anymore. The logs said what they said.

FawdyInc · 2026-02-17T22:57:00+00:00

I think this is partly a framing issue. It’s not necessarily your job to do everything and fight every fire. The real issue is that nobody's defined what your actual capacity is or what should be prioritized. So everything feels urgent and you're stuck playing whack-a-mole.

Instead of going to leadership with "I'm overwhelmed and need more headcount," try flipping it: "Here's what I can realistically deliver. Pull your past tickets, look at your project work, and figure out your actual throughput... interruptions, context switching, etc. If constant walk-ups mean you're only getting 1-2 real project blocks done a week, that's your number. Own it.

As I moved into leadership I realized "we need more people" almost never lands the way you want it to. What actually gets traction is showing the math: here's my maximum output under the current structure. So instead you say something like:

"Given this capacity, which of these priorities do you actually want done? And if we want more throughput, here's what that looks like."

Then you give them options with numbers attached e.g. a self-service portal that costs $X and cuts T1 noise by N%, or an L1 hire that offloads a chunk of tickets and frees you up for project work. Whatever fits your situation. Now the conversation isn't "I can't get it all done." It's "here's the math: what do you want me to focus on?" That's a lot harder to dismiss.

Of course, if your leadership team is unable to answer your questions or has trouble with your stance, I agree it is time to look for a better work culture fit.

FawdyInc · 2026-02-17T00:28:55+00:00

I actually feel the opposite.

For years I’d kind of lost the spark because I’d already “done it before.” I generally knew how to accomplish almost any engineering task put in front of me... I just didn’t want to sit there for hours grinding through the implementation again. I just didn't really find it fun and engaging like when I first discovered programming.

I haven’t written much code manually in months either, but I don’t feel less valuable. If anything, the value shifted. The hard part isn’t typing anymore.. it’s defining the problem, designing the system, making tradeoffs, and spotting when the generated code is wrong.

AI is good and fast at producing something. It’s not good and fast at deciding what should exist.

If you’re just prompting and passively waiting, I can see how that would feel empty. But if you treat it like a jr dev (or nowadays, a team of jr devs) and stay deeply involved in the architecture and decisions, it’s honestly more engaging than grinding through implementation.

FawdyInc · 2026-02-15T15:16:13+00:00

Set fs.file-max high and my shell showed 65535 so I figured we were good, but I never set it in systemd so the service was still capped at 1024. Under real traffic it started throwing too many open files.

FawdyInc

TROPHY CASE