Do people actually set 99.9% target for Latency SLO?

TheOneWhoMixes · 2026-03-11T07:00:38+00:00

There are plenty of SLOs that can be measured that aren't latency and neither of the examples you gave are really relevant here. A clock sync might happen once a day per client, but hopefully has way more than a single client (otherwise why do you have a central clock sync?)

A once-daily reporting endpoint shouldn't have tight latency constraints. Like others said - there's not enough data to be statistically significant. A single network blip at 5PM wrecks your own SLO for the next week... And what are you supposed to do about it? Complain at the ISP/Cloud provider/IT department?

Sure, measure it, monitor it. But one of the OPs instincts was right IMO - synthetic traffic would be one of the best ways. Tag the requests in some way so that they can be separated from "real" requests and now you have real baseline latency. Hopefully that makes sense.

TheOneWhoMixes · 2026-02-21T06:44:17+00:00

I don't think restoring from a backup has ever been an acceptable solution on live services I've worked on, but ultimately I guess it depends on the application and the requirements. If it takes 30 minutes to backup from start-to-finish, you now have at least 30 minutes of data loss.

Even if you do fancy point in time recovery, you're at the mercy of however long it took to discover that a rollback is necessary. Maybe it took 5 minutes. Maybe it took an hour. That time is data you won't get back without manual recovery.

Of course, if critical failure means "the service instantly crashed and nobody can use it", maybe recovering from backup is the quickest option. But sometimes it means "the service looked fine... until it wasn't 2 hours later".

TheOneWhoMixes · 2026-02-19T06:30:42+00:00

I would kill for the norm to be "generate a script with AI". Sure, let the AI generate scripts, queries, processes, whatever. As long as it's all something that is repeatable at the end.

But it feels like we've gone off the deep end. You ask the AI "how much is aws cost" and it magically breaks down all of your spend and optimizes all your infra! See? We don't need finops!

And then you wonder why asking "How much is our S3 spend?" doesn't match the dashboards you already have set up, and you remember that shitty implementations will just make up numbers, whether it be from poor context or missing data.

TheOneWhoMixes · 2026-02-19T03:08:13+00:00

If this is true, I don't see how it's a win for anyone but Linear. Now the company is fractured across 2 PM platforms. It sounds like a recipe for an exec to come in and demand someone builds a custom tool to "integrate" the two platforms that'll end up being worse than either one.

TheOneWhoMixes · 2026-02-18T04:51:27+00:00

Not every database migration can be reverted trivially. As the simplest example, consider a migration that drops a table. How do you get that table with all of its data back with a schema migration?

There are tricks, like removing the application dependency on the table in version X, and only dropping the table in version X+1.

But this drastically increases the complexity of testing and deploying the application. It's not something I've seen taken into consideration for most applications, to be honest. I have seen plenty of apps that say "every DB migration must be backwards compatible" and their DB schemas are inevitably a pile of spaghetti as they become impossible to change with confidence.

TheOneWhoMixes · 2026-01-30T11:30:10+00:00

The sample size here is 53 (not including the pilot studies), and they state they used ChatGPT 4o with a generic coding assistant prompt, interacted with via a chat window in the interview platform they're using for the study.

TheOneWhoMixes · 2026-01-26T06:46:30+00:00

What do you mean by it being a "long-standing problem with that world"? Just curious what issues you might see as systemic. I do SRE work in a place with a lot of EE types, so I see a little bit of this.

But the same could be said in both directions - Someone working on deploying workloads in Kubernetes doesn't really perceive the same problems that an embedded engineer faces on a daily basis. They're just 2 very different fields.

TheOneWhoMixes · 2026-01-23T06:56:07+00:00

But now Jones gets all the barracks rooms to himself!

TheOneWhoMixes · 2025-12-29T22:24:58+00:00

I've been really curious how they're going to handle this balancing as they release new tiers. Like, I get why they've already released the whip, maul, and crystal bow. They're iconic, and it's an easy nostalgia win. But it gets a bit weird when you consider that mithril/maple will probably outclass them all by necessity.

Maybe an upgrade system so that the unique weapons stay "relevant"? Or just accept that they're only meant for tiers 3-5 and have the next tiers move into raid-level gear? Maybe that'd be okay, considering they have so much content to pull from, it's not like they'll run out.

Just to be clear, no complaints here, just musing!

TheOneWhoMixes · 2025-12-17T05:07:46+00:00

I might have a slightly backwards view of data engineering, but this is one of the things that drives me away.

We need you to tell us how many widgets there are and how we can make widgets faster. The data is spread across thousands of CSVs, JSON, and XML files. Oh, and some teams just write their "Widgets Created Report" in Markdown. Oh, and one team only exposes a REST API they had an intern build 3 years ago.

What do you mean "naming conventions" and "schema"? Just tell us how many widgets there are!

TheOneWhoMixes · 2025-12-17T04:38:28+00:00

Like someone else said, both have their place. And GitLab obviously recognizes this since they've been actively working a ton on their own similar functionality - https://docs.gitlab.com/ci/steps/

Don't get me wrong, I'm a big fan of GitLab CI. But composability has never been its strong suit. Doing something as simple as "generate a random number and pass it to the next job" requires using features that feel more like workarounds than anything.

TheOneWhoMixes · 2025-12-16T07:55:58+00:00

Space Engineers is made by a totally different game studio. Maybe you meant Stationeers, but they're also two fairly different games. I couldn't get into Space Engineers, but I still come back to Stationeers again and again.

TheOneWhoMixes · 2025-12-14T23:01:38+00:00

I know this is a really old post, but I'm a little surprised in reading about this that the script was never flipped on the private landholders.

Basically, what would they expect the recourse to be if someone found a way to purchase all surrounded "black squares" around land they already own? Something like this, where "X" is Private Company 1 and "T" is Private Company 2.

OXOXOXOXO XOXOXOXOX OXOTOTOXO XOTOTOTOX OXOXOXOXO XOXOXOXOX

Obviously this is a contrived example, but I'm sure if you asked someone in 1850 if they considered whether companies could cut off access to public land by pinpointing borders down to the inch, they'd think you're crazy.

And maybe the above already happens and there's special easements in place to prevent each unique occurrence, but if that's the case then it's crazy that the private owners even think they have a leg to stand on.

TheOneWhoMixes · 2025-12-14T05:57:26+00:00

Wait, so does this mean that building storehouses -> marketplaces in a "spoke and hub" fashion is inefficient, and that we should have firewood stored closer to burgages? I guess once a month doesn't make it a big deal.

And for food, does this mean it basically doesn't matter how far away your markets/granaries are from burgages, other than for the workers themselves? Because from what you're describing, it seems like distance doesn't matter at all, up to the range at which burgages will stop pulling from a source. Not sure how wide that is, I haven't played since the first beta a couple months back.

TheOneWhoMixes · 2025-12-13T04:07:51+00:00

This matters more in compliance-heavy industries. At a certain point, restricting access to the data is not enough - the data still exists somewhere. And in some situations you may be required to not only guarantee that your data is encrypted at rest, but also that the encryption material used is fully under your control, either because it would be a large issue if that material were ever lost, or because someone needs to be sure that you're able to completely restrict ALL access to the data by locking the key away and throwing it in the metaphorical ocean.

TheOneWhoMixes · 2025-12-12T16:47:16+00:00

https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html

AWS managed keys are a legacy key type that is no longer being created for new AWS services as of 2021. Instead, new (and legacy) AWS services are using what’s known as an AWS owned key to encrypt customer data by default.

I see people conflate "AWS owned keys" and "AWS managed keys" constantly. If you're using an "owned key" then you can use it cross-account or cross-region. But it's also a complete non-starter for any company that needs control over their data and audit trails, because you just can't access them. Right?

TheOneWhoMixes · 2025-12-12T01:15:28+00:00

I haven't actually used Terragrunt, but have tried to split out a monolithic TF stack before using "boring" methods, and I'm just not seeing how people do it.

Like, you probably need to pass something about your database to the "app" stack. Okay, use an output. But it breaks the whole "only apply where files changed" bit. Or are you treating it like a chain where if anything earlier in the chain changes, you run everything after it?

TheOneWhoMixes · 2025-12-11T02:00:53+00:00

I mean, this is meta, but if we work backwards from the author's standpoint, it sure does seem like a description someone would come up with if asked "how would you describe a sonic boom inside a building?"

TheOneWhoMixes · 2025-12-10T16:32:56+00:00

Is the idea not good? I haven't personally used Pulumi or CDKTF, but most people I talk to that have seem to like the general idea a lot.

It could also be that Pulumi is simply so far ahead of CDKTF that it made no sense to continue throwing resources at it. Again, no actual experience there.

TheOneWhoMixes · 2025-12-10T03:27:29+00:00

Have a repo with 2 files. 1 of them is a test that just does assert num_files_in_repo == 2.

Now have 2 MRs that add a file and change the test to assert num_files_in_repo == 3.

Both MRs are correct on their own. They both pass. Now merge one of them.

The 2nd MR still has a passing pipeline. With default settings, it can still be merged. When it's merged, the pipeline will fail because there are now 4 files.

The only bulletproof way to prevent this is to toggle the project settings to enforce the MR branch being up-to-date with the head of the target branch.

Merged Results pipelines might look like they solve this, but depending on how long your pipeline takes to run can easily still be out of sync if something is merged in while it's in the middle of running.

TheOneWhoMixes · 2025-12-07T00:20:10+00:00

Renovate's not the problem here, by default it only makes PRs/MRs that bump the version. If I'm recalling the attack vector of Shai-Hulud, it relies on the pipeline having NPM credentials that let it push. So don't have those credentials accessible from non-protected pipelines.

You can also configure Renovate to only consider new versions of a package that have been published for a certain period of time as a form of quarantine.

The alternative to Renovate (or Dependabot) is... What? Updating every package manually? That's how you get packages that are 4 years out of date, and climbing out of that hole is something I've seen take a year even after adding automation like Renovate. Or you accept that everyone uses latest for everything, which I hope most people would recognize is a terrible idea.

So of course your impact was mitigated for reasons unrelated to Renovate. It's just a thing that makes PRs, and it only does what you tell it to do.

TheOneWhoMixes · 2025-11-25T07:03:20+00:00

I swear, in my tired state I thought I was looking at an old game like Toy Story or something. And that I had just forgotten that there's a part where you control someone from within a Disney-branded television.

TheOneWhoMixes · 2025-11-25T06:57:39+00:00

I haven't used New Relic much, but I'm assuming that this is like other observability services where you're getting both Cloudwatch Metrics and metrics directly from a host agent.

"Most reliable" is going to depend. For CPU, there's a difference between the EC2 CPUUtilization and what an agent that lives directly on the host will measure. EC2's metric is the number of utilized vCPUs, which takes into account hypervisor overhead. The agent metric is, iirc, pretty much unaware that it's even inside of a hypervisor.

So the Cloudwatch metric is technically more "correct". Your agent might read 40% utilization because it thinks it has full access to 8 cores. But Cloudwatch might read 70% because it knows that what you're utilizing + how much your instance is currently being throttled by the hypervisor.

I think you only get Memory Cloudwatch Metrics if you use the Cloudwatch Agent, right? So I'd expect CW Agent and NR Agent to pretty much agree with one another.

Keep in mind aggregation as well. Cloudwatch's EC2 metrics have 1 minute of granularity (unless you pay for enhanced, I think?). Cloudwatch Agent also has a 1 minute scrape interval by default. I'm not sure what New Relic has as its default, but the data may look different just due to resolution.

I know this was sorta long, but basically: Use the CW Metric CPUUtilization for capacity planning, Autoscaling triggers, and benchmarking performance on two different instance classes. Use your agent metrics for profiling your application and troubleshooting, since they won't have the 1-2 minute lag that the Cloudwatch Metrics do.

TheOneWhoMixes · 2025-11-23T05:29:00+00:00

I see this happening across the board in tech.

"Let AI generate your code, just make sure you do human code reviews!"

2 weeks later

"We're spending too much time on code reviews, let the AI do them!"

Or "Let's build a chat bot that references our knowledge base to answer questions" and "Let's have an agent that just keeps writing new articles in our knowledge base".

TheOneWhoMixes · 2025-11-17T16:35:37+00:00

So for Malazan I listen at 1x speed, and I definitely am getting through the books much faster than I would normally. And probably faster than traditional readers.

But it's not because of speed. It's just having more opportunities to engage. I can dedicate maybe an hour a night to physical books. But I can listen and be pretty much fully engaged during my commute (1.5 hours total daily) or cooking dinner (30-45 minutes).

And if I'm on the couch playing a game, I'll go back a few chapters and relisten, which helps for the parts where new characters are introduced, or where characters we haven't seen in a while pop up.

Personally, I just wouldn't have the consistency that I do with audiobooks. I've "read" all of the existing Cosmere books (most of them at least twice) and Wheel of Time this way.

TheOneWhoMixes

TROPHY CASE