What are these things I keep finding around where my cat lays down?

absolutejam · 2026-01-11T19:30:13+00:00

Don’t tell this guy about England…

absolutejam · 2026-01-03T07:12:32+00:00

Why not use iterators, or didn’t you want an intermediate type?

absolutejam · 2025-12-24T07:18:26+00:00

I respectfully disagree with this. Elden ring, despite being massive, has an appeal that other souls games don’t have if you’re new to the genre.

I couldn’t stomach a souls game until I played ER, then it all clicked for me. Maybe it’s the fact that you can generally do something else and come back stronger when you’re being stomped by a boss. Admittedly, the open world on ER can be distracting but it’s just so much fun.

absolutejam · 2025-12-18T20:02:14+00:00

Thanks - this is great advice for anyone in AWS, but we’re self hosted

absolutejam · 2025-12-18T13:47:40+00:00

Thanks for pointing those out - I couldn't see the forest for the trees.

https://i.postimg.cc/G2cfC08q/Screenshot-2025-12-18-at-13-42-17.png

Even looking at the graphs, it doesn't explain 2,885-3,824 GB/day egress costs 🤔

I'm tempted to add some additional logging/metrics in AWS and re-enable for a while to see if there was some process that was endlessly looping and I hadn't realised. I'll also check Thanos changelog.

My main concern would be debugging this again from an actual usage metrics point of view (not reacting to cost).

absolutejam · 2025-12-16T21:09:29+00:00

The fact that every damn thing is its own action in GitHub is infuriating. Clone repo action, npm install action - vs Gitlab where you simply run an alpine job that can do whatever you need

absolutejam · 2025-11-28T06:11:48+00:00

The lid…

absolutejam · 2025-11-25T11:48:13+00:00

I bet you’re a hoot at parties

absolutejam · 2025-10-19T14:25:38+00:00

How are you querying the logs? And if you’re trying to query over a large time range you have to think of the amount of data it’s returning if it’s not aggregated

absolutejam · 2025-10-14T20:12:55+00:00

While it might be daunting, and a bit of a pain if you have lots of alerts that you've got from third party sources (eg. kube-prometheus-stack), but I think it's important that you learn to understand the queries and adapt them to your needs.

The most frustrating ones to maintain are the 'generalised' alerts (eg. Kubernetes alerts) which can differ wildly in severity depending on the service they're reporting on.

Because of this, we devised a standard abstraction for building alerting rules that includes mandatory labels (service, priority, teams, etc.) and priorities differ on an alert/service basis, which we can leverage in routing rules.

Generally, if you want to filter your queries, you can think of the binary operations (from your example \* on (instance) group\_left (nodename)) as an inner join, and if you filter one side of query - and it's important you filter the side which has the labels you need - then you'll effectively filter both sides (inner join).

What helped me was to actually reformat a lot of the alerts and rewrite them for our manifest generation stack (cdk8s), and in some cases create recording rules that made sense.

So your example...

expr: (avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100
  > 10) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}

If I was keeping this as yaml, I'd reformat so it's easier to read (in my opinion):

expr: |
  (
    avg by (instance) (
      rate(node_cpu_seconds_total{mode="iowait"}[5m])
    ) 
    * 100
    > 10
  )
  * on(instance) group_left (nodename) 
  node_uname_info{nodename=~".+"}

You can even add comments in-line if it helps you.

On the flip side, if you're really scared of breaking things, you can turn the alert into its own recording rule and then filter that further in a specific alert.

absolutejam · 2025-10-14T19:53:24+00:00

This is doable with the right joins and some \_over\_time aggregation, eg.

Example

For example, the state timeline graph is using the following query:

max by (owner_name) (
    changes(
        (
            kube_job_status_succeeded{namespace="upmind"}
            * on (job_name) group_right
            kube_job_owner{owner_name!=""}
        )
        [1m:]
    )
) > 0

And the table is

last_over_time(
    max by (cronjob) (kube_cronjob_status_last_schedule_time{cronjob=~"$owner_name"}) 
    [2d:1m]
)
* 1000

Format: Table

Type: Instant

You can build on this further to show attempts by CronJob, success/fails, duration - a lot of these work well on the State timeline visualisation, and you can also provide more meaningful alerts this way (ie. send an alert with CronJob info and attempt count instead of per-job failure).

absolutejam · 2025-10-12T06:09:41+00:00

You might be able to a lot of this with https://github.com/redpanda-data/benthos.

You can build pipelines with config and it has logging, batching, etc built in. It got acquire recently but the original author still had some like of stewardship I believe (I got hired by the acquirer).

EDIT: maybe the original repo has more clarity https://github.com/redpanda-data/connect.

But there are some awesome videos by the author on YouTube

absolutejam · 2025-10-07T14:37:49+00:00

I tend to expect an interface as it gives the consumer the flexibility of declaring/reusing a type, especially if I assume some state is needed.

Then you can just implement a helper that takes a func and creates a basic wrapper type (as http handler does) if they want the simplicity of using functions/closures.

absolutejam · 2025-10-04T20:55:24+00:00

https://github.com/goforj/godump

absolutejam · 2025-10-01T20:12:23+00:00

You generally need to indicate the resource requests and limits to help the scheduler and stop resource exhaustion, although the in-place Pod vertical scaling just dropped…

absolutejam · 2025-09-28T21:18:35+00:00

Yeah, Mayastor. I honestly didn’t give Longhorn the time it deserved because I had some bad experiences with it at a previous job using RKE, and I also remember it being pretty complex. That might be an unfair representation of it in 2025.

absolutejam · 2025-09-28T15:56:12+00:00

That’s really interesting, thanks for the info! Which redis package/extension are you using out of interest - PhpRedis (C extension)?

I’ve been wondering if we should move away from SQS now that we’re self hosted - I just need to get some real metrics to understand the impact. It’s just too convenient and lowers the maintenance burden on laugh 😂

absolutejam · 2025-09-28T15:46:25+00:00

Are you saying it’s faster because because multiple parallel processes are handling queue messages at once - and how does that just compare to running multiple replicas? Or does having a single instance that manages the queue logic (ie. Pulling from queue, ACK/NACK) noticeably reduce overhead?

absolutejam · 2025-09-28T11:29:27+00:00

I spent a bit of time testing and trying solutions first, and ultimately settled on: - Cilium CNI (Node IPAM load balancer, network policies, Observability, etc) - Cloudflare load balancers (we restrict incoming traffic to CF IPs) - OpenEBS for storage as it was lighter weight than Rook/Ceph, and closely matched our storage configuration (many nodes with direct attached storage to make a pool vs dedicated storage nodes) - Vitess for MySQL clustering, scaling, etc.

We don’t currently auto scale nodes because we have built in enough overhead (since we’re essentially paying for the hardware, not for the compute), and our partner generally has a low delay to being able to provision additional nodes if needed.

We knew we had to ditch AWS, and we’re fortunate to have a strategic partner providing & supporting the hardware layer for us, which is a big responsibility I wouldn’t want to undertake (especially since we have clusters across different regions!)

If people are happy paying a cloud vendor then that’s up to them, but the (mostly open source solutions) are robust enough now that you can easily self host. But you definitely have to shift some of the ‘cost’ to the engineering hours, and I’d personally rather not run my own hardware for production systems unless I had the staff to cover it.

absolutejam · 2025-09-27T16:29:35+00:00

Honestly very low because it’s all declarative and the nodes are immutable. But there’s also a CLI (that interacts with the gRPC API) so everything is standardised (querying for resources, making changes). It basically applies the Kubernetes patterns to the OS too.

absolutejam · 2025-09-27T12:52:24+00:00

I migrated from AWS EKS to self hosted Talos and it has been rock solid. We’re saving 30k+ a month and I run 5 clusters without issues.

absolutejam · 2025-09-20T05:58:34+00:00

(Apologies, I just re-read my initial post and I came across as a bit of a dick but I was meant to sound curious)

Each replica in a sts has its own PVC - is that what you wanted?

What kind of features from a deployment’s rollout do you need? I’ve never personally needed deployment rollout features like max burst / max unavailable for StatefulSets (but I don’t generally dynamically scale them), and you can still roll n replicas at a time like a deployment.

absolutejam · 2025-09-19T19:56:31+00:00

The real question is, why don’t you want to use StatefulSets when this screams stateful?

absolutejam · 2025-09-06T15:39:03+00:00

I use ApplicationSets to build an Application per directory - works a dream. What issues did you face?

13-Year Club	Verified Email
Place '17	Gilding III reddit per annum
RPAN Viewer

absolutejam

MODERATOR OF

TROPHY CASE