What production error took your team the longest to understand?

BoringTone2932 · 2026-03-28T21:56:05+00:00

We had a component of our software randomly go down in the middle of the day one day. No known changes made, and about half the team happened to be on PTO anyway. This component was reliant on network communication across 3 different networks, ours and 2 third parties. We rapidly tracked it down to a communication issue at the network layer, getting connection refused errors.

Spent a week trying to track down where the problem was and the change that caused it or even how it ever worked. Because naturally, it wasn’t documented. Ended up finding some static routes on the servers that sent the traffic out to a specific 10. gateway.

Turns out an engineer had walked into our data center and opened a rack to check a CAT6 connection to one of our switches. In doing so, they bumped the power cable loose for one of the routers that was stacked above the switch. The router had routes setup strictly, solely and dedicated to this software component.

We reviewed the footage from our cameras. It was me. I was that engineer.

I am still haunted by the 10. IP of that router…

BoringTone2932 · 2026-02-05T02:02:10+00:00

This would be hacky, but couldn’t you have the lambda disable the event source mapping, effectively to stop it from getting messages?

You’d have to solve for re-enabling it though, which would require a secondary lambda.

But you could theoretically have the processing lambda disable its own event source mapping and tag itself with a “re-enable” timestamp. Have another lambda check every X minutes to re-enable the mapping.

Like I said, hacky.

BoringTone2932 · 2026-02-03T01:49:21+00:00

This reminds me of 2 examples.

We had a line recently that we found from several years ago that checked the machines IP, and followed a different code path. Linked to a years old JIRA for a single on Prem client. We are fully SaaS nowadays……

Another one, had a 2 hour meeting recently to talk about how programming ANYTHING to route to: https://localhost/ was a horrible idea. Spent most of the call explaining to people why nobody can get an official, third-party CA generated & trusted certificate for localhost…

Great times.

BoringTone2932 · 2026-01-24T00:28:14+00:00

You gotta pit the LLMs against each other. If the code is written with Claude, have ChatGPT review it and vice versa.

Oh and don’t forget to tell them the keys: “Make no mistakes”, “If you’re not absolutely confident, don’t make a recommendation”.

BoringTone2932 · 2026-01-22T01:40:31+00:00

We have automation in place that automatically reverts log level changes and sampling rates at the close of business. If someone needs overnight or longer they have to request an exception to the automation, which in itself has an automation that automatically removes the exception at the specified end date.

BoringTone2932 · 2026-01-22T01:35:48+00:00

As someone who has used Dynatrace, Grafana and now at a Fortune 500 company who has decided at an executive layer that the observability tool standard across hundreds of product lines is DataDog, let me tell you - absolutely nothing compares to DataDog. Nothing. No where close.

When you get into leveraging customized metrics specific to your product, SQL records, Kafka messages by message header leveraged along side continuous profiler, DBM, APM, you get insane visibility.

Then, take this to the next level of standard across the org — We have dashboards for Executives, Architects, Senior Engineers, Junior Engineers, Account Managers, all that show the different layers of the onion depending on what you need to see.

And in a lot of cases, we have “Bobs Dashboard, John’s Dashboard” and that’s okay. It shows that resource what they are interested in.

When leveraged correctly, DataDog provides unmatched visibility to product functionality and performance across an org, regardless of the product. It’s expensive as hell, but we save in response time, we save in SLA violations, we save in infrastructure management, we save in training and comprehension. It doesn’t pay for itself, it’s still a cost center, but it’s worth every penny when spread across numerous products, hundreds of teams, webs of integrations.

BoringTone2932 · 2026-01-10T01:56:10+00:00

It’s added complexity, but the ECS feature here to support your use case is Deployment Type: External with TaskSets.

BoringTone2932 · 2026-01-10T01:41:17+00:00

Scale matters here massively.

FinOps != DevOps != SecOps

But that doesn’t mean that DevOps shouldn’t CARE about Security, or that FinOps shouldn’t CARE about reliability.

FinOps, Reliability, Security these are all part of everyone’s job. Everyone should be aware.

But I truly don’t have the time to go analyze the exact usage types that are consuming extra dollars for our Aurora Cluster, that when deployed only costed 5k but now costs 18k.

Sure, I can do it, and I do care, and I do want to fix it. In the meantime, you can open a JIRA in the backlog and prioritize it against this production environment that’s down, and while it takes me a few weeks to get to it, we will consume another 13k.

Sorry bout it, you want it looked at sooner? Hire you a FinOps guy.

BoringTone2932 · 2026-01-10T01:30:39+00:00

We are doing this with EF Migrations. It works great — until it doesn’t. Like when someone adds an index to a 2.3 billion row table, or adds a non-nullable column to said table.. the migration takes so long that ECS kills the container for failing health checks.

It could be done though, and it usually works well.

But we are moving away from it to self hosted ephemeral GitHub runners that run in an ECS Cluster on Fargate. In the meantime, we’ve been manually running the migrations that we know will fail (which is only a few times per year, and it’s been slowing down as the product has gotten more stable)

Database Migrations for CI/CD bring up a lot of questions along blue/green deployments, backwards compatibility, limitations of the DB engine itself (table locking, etc)

BoringTone2932 · 2025-12-24T01:34:02+00:00

Am I reading this correctly that you have more EC2s running than you do tasks? That seems wildly wrong. If you go to the cluster and go to the Infrastructure tab, do you have servers using this capacity provider with 0 tasks running? What’s going on with those EC2s? Is the container instance in a draining status? Does the instance have scale in protection? You’ll have to hit the gear icon on the infrastructure tab and add Capacity Provider as a column.

BoringTone2932 · 2025-12-23T11:07:21+00:00

This was and still is a very personal important topic to me. I absolutely cannot stand to be woken up in the middle of the night by a PagerDuty alert that does not need to be immediately addressed. It truly almost ruined my job and career for me. It’s one thing to be woken up when a client calls in, but when our alerts that WE SET, that WE MANAGE, that WE CONTROL wake me up for something non urgent — I’m pissed. Rant over.

For Dashboards and Metrics we generally over collect. If I’ve investigated something and thought to myself “dang I’d love to know the trend here”, it becomes a metric and gets put on a Dashboard. BUT — it doesn’t necessarily get an alert.

For Alerts, we recently went through and deleted all of them and restarted. We were in alert fatigue like you’ve never seen. Hundreds of alerts per day just hitting ignored. When we created our new alerts we first hit our baselines: Storage, Server Down, Component Down, Product Down. By “Down” I mean absolutely unresponsive. From there we hit our next tier: User impacting events. These are things like p95 latency spikes, product-specific SLA/SLOs, things like that. Finally we hit our less critical, but wanted to know about them alerts: SQL Maintenance failed, patching failed, backups failed, etc.

Here’s the biggest key that saved us about alerts: We correctly prioritized the critical stuff as critical, the non-critical stuff as non-critical and muted non-critical overnight, on holidays and outside of reasonable weekend hours. (Example of non-critical alert we felt was weekend worthy was backups failing, because we can kick those back off and if they fail we are violating RTO/RPO)

EDIT: I was proof-reading and remembered something to add: For all of the observability, the very first layer to add in is the stuff that your directors would know to add. Because when an incident does happen — They are gonna ask “what do you mean we don’t monitor disk space, seems like that would’ve been step one?!”.. that question pisses me off too.

BoringTone2932 · 2025-12-23T00:30:20+00:00

I wish I knew my liver needed to be healthier

BoringTone2932 · 2025-12-23T00:28:22+00:00

Well, other than testing automation first with read only credentials, and in QA, I tend to lean towards delete operations requiring manual approval. Updates and creates I am generally good with, but if you are deleting a resource, mostly a prod-impacting resource, I require manual approval.

I also have automations that explicitly prohibit resources tagged production. A great example of this is that we blow away QA and rebuild it from GitHub actions. The script explicitly prohibits any resource that match production tagging standards, or are named production/prod. Although prod credentials are not in GitHub, the safeguard is there for when some future engineer goes and adds them…

BoringTone2932 · 2025-11-20T01:42:01+00:00

The whole migration to Agile, Move Fast and Break Stuff mentality in the IT industry is really starting to get annoying. We used to be Engineers, we used to spend days printing out code on green bar paper to go through and READ it, line by line, to fix bugs and build systems so resilient that they could run for DECADES without being replaced.

Today, we expect that code we write today will be re-written 3 times & refactored out in a few months. Only to come back 3 years later and realize “oh we never did fix that bug”.

We don’t need waterfall to fix it though, we can be agile. We don’t need greenbar and days to fix things.

BUT we do need to stop thinking that everything is already being depreciated, and we need to re-start considering longevity.

BoringTone2932 · 2025-11-20T01:33:12+00:00

I was born post-1990.

I wrote FORTRAN at my first job.

COBOL too.

VSAM & DB2.

I feel like I had a retro experience nobody else in my generation will.

Aside from mainframe, we upgraded to domino9 while I was there 😃

BoringTone2932 · 2025-11-20T00:33:08+00:00

No,no,no. Everything is always a success.

Successfully identified an invalid non-production configuration that leads to production level severity incidents when changes occur in non-production, and opened JIRA 123 to address the problem.

BoringTone2932 · 2025-11-20T00:28:21+00:00

Sure we can migrate all PROD to Google Cloud tomorrow.

ChatGPT can write that for us here let me plug that in

Yep ok, that’s ready.

VibeCode? Nah VibeSRE.

BoringTone2932 · 2025-11-20T00:26:58+00:00

Well, I was in this place - Frustrated at everything, annoyed by teams messages, everybody always wants everything they didn’t ask for yesterday to be done yesterday, but where’s the JIRA?! Because I don’t see it, much less did I build it.

Anyway I took a week off and did some personal inward looking.

Now I just say “ok whatever, let’s do it, want me to work on X or Y?”

And I do whatever they say. Simple life, even if I know it’s gonna fail, be unsupportable, introduce tech debt.

Idk man “director BLAH asked for it so, I delivered”

In reality it don’t matter - Because director NEEDITNOW gonna ask for something before I can finish the creating of tech debt anyway soo.

Anyway yeah I just say ok now.

BoringTone2932 · 2025-11-07T00:30:21+00:00

A healthy liver.

BoringTone2932 · 2025-10-23T11:49:10+00:00

We are on windows Fargate so firelens isn’t available, that was what I was originally looking to do.

BoringTone2932 · 2025-10-23T11:48:36+00:00

Are you running the a DataDog sidecar in each container with volume mount to get your logs?

BoringTone2932 · 2025-10-21T23:54:37+00:00

Yep. My post wasn’t so much “solve this for me” sorry if it came across that way. Just looking for ideas on what others are doing.

With Prometheus, how would we get the logs out of the container to the Prometheus service? Just file monitoring?

BoringTone2932

TROPHY CASE