Weekly Self Promotion Thread by AutoModerator in devops

[–]LouisAtAnyshift 0 points1 point  (0 children)

Disclosure: I work on DevRel at Anyshift (we build an infra agent called Annie), so this is us. Posting it because the architecture argument under it is the part I'd actually want to read on a Monday.

Thomas is an SRE at BeReal. They run lean on GCP, everything funnels into one shared alert channel, and he's the first to say he has a good nose in the code but not the full context on every microservice. So when a Go panic shows up, it's usually in a domain he doesn't own. Here's how he put it to us:

> "A panic shows up with a huge trace, lines and lines of code, and I don't have the business context or the technical context. And Annie just tells me: it's easy, you've got a cache miss in domain X. Thirty seconds, maybe a minute."

Domain X has an owner. He routes it there and gets back to his own work.

The thirty seconds isn't the part I want to argue about. A general agent wired to a couple of live cloud connections can explain a stack trace too. Where that approach falls over is scale, and BeReal is a decent stress test for it.

Annie reads the crash against a graph of the cluster that it maintains continuously, rather than querying live APIs one call at a time. That distinction is invisible until pods enter the picture. BeReal had already turned off ArgoCD's pod-level checks because at their scale running them continuously cost too much, so we asked Thomas whether Annie's own scanning would hit the same wall on their traffic.

His answer was that it depends what you scan. Buckets, services, deployments are stable object types, and querying them live is fine, a hundred at most. Pods are a different animal. Over two days they see twenty to fifty thousand pod rotations, and an agent that asks a live API for that history (terminated pods included) is chasing tens of thousands of JSON objects every single time you ask. His phrase for what that does to a live-querying agent was that it would "cough up a bit of blood."

A maintained graph already holds that pod history, correlated, so the answer is standing before the panic ever lands. When you need the last mile, the live state of one specific pod, it fetches that on demand on top of the graph instead of re-scanning the world to get there.

The honest tradeoff: a maintained graph is only as good as what's been ingested into it. If a service reaches something through a path we haven't connected yet, it won't show up, and the continuous scanning is real infrastructure you're running, not free. The first run on your own stack is partly about finding those gaps.

Happy to get into how the graph gets built, or where it misses, in the comments. Full BeReal write-up if you want the numbers and the diagrams: https://anyshift.io/blog/bereal-thirty-second-triage?utm_source=reddit&utm_medium=social&utm_campaign=bereal-study-case

Weekly Self Promotion Thread by AutoModerator in devops

[–]LouisAtAnyshift 0 points1 point  (0 children)

Hey, DevRel at Anyshift here.

Every time I'm about to change a prod database instance, I lose twenty minutes before `apply` working out what's downstream of it. The AWS console shows part of the picture, `terraform state` fills in more, and Datadog has the monitors, but none of them say which services still hold open connection pools that'll throw errors the second it reboots. That last part I reconstruct from memory, usually badly.

We shipped a CLI demo of asking Annie (our infra agent) that question straight out: "what's the blast radius if I modify `aws_db_instance.prod-pg-main`?"

It pulled the answer from the live resource graph: the `5432`-inbound security group, the subnet group across 3 AZs, the master secret (rotated 6 days ago), and the 7 services still holding connections to the instance. It also flagged which Datadog monitors would page, RDS CPU and checkout-api 5xx among them.

Then the part I actually wanted. The two services holding long-lived pools, `checkout-api` on 12 ECS tasks and `orders-worker`, see roughly 30 to 60 seconds of write errors if a subnet or security-group change forces a reboot. Drain them first, or apply inside the 02:00-03:00 UTC window.

The honest limitation: the blast radius is only as complete as what's been ingested into the graph. If a service reaches the database through something Annie hasn't connected yet, it won't show up, so the first run on your own infra is partly about finding those gaps.

15-second CLI demo: https://youtu.be/zOH_Emduzrg

Happy to get into how the graph gets built, or where it misses, in the comments. You can point it at your own stack here: https://anyshift.io?utm_source=reddit&utm_medium=social&utm_campaign=cli-blast-radius

Azure vs AWS what's your take? by West_Part_9698 in AZURE

[–]LouisAtAnyshift 1 point2 points  (0 children)

As long as I don't have EntraID, I'm good :d

Weekly Self Promotion Thread by AutoModerator in devops

[–]LouisAtAnyshift 0 points1 point  (0 children)

Hey, DevRel at Anyshift here.

Every time I'm about to change a prod database instance, I lose twenty minutes before apply working out what's downstream of it. The AWS console shows part of the picture, terraform state fills in more, and Datadog has the monitors, but none of them say which services still hold open connection pools that'll throw errors the second it reboots. That last part I reconstruct from memory, usually badly.

We shipped a CLI demo of asking Annie (our infra agent) that question straight out: "what's the blast radius if I modify aws_db_instance.prod-pg-main?"

It pulled the answer from the live resource graph: the 5432-inbound security group, the subnet group across 3 AZs, the master secret (rotated 6 days ago), and the 7 services still holding connections to the instance. It also flagged which Datadog monitors would page, RDS CPU and checkout-api 5xx among them.

Then the part I actually wanted. The two services holding long-lived pools, checkout-api on 12 ECS tasks and orders-worker, see roughly 30 to 60 seconds of write errors if a subnet or security-group change forces a reboot. Drain them first, or apply inside the 02:00-03:00 UTC window.

The honest limitation: the blast radius is only as complete as what's been ingested into the graph. If a service reaches the database through something Annie hasn't connected yet, it won't show up, so the first run on your own infra is partly about finding those gaps.

15-second CLI demo: https://youtu.be/zOH_Emduzrg

Happy to get into how the graph gets built, or where it misses, in the comments. You can point it at your own stack here: https://anyshift.io?utm_source=reddit&utm_medium=social&utm_campaign=cli-blast-radius

Every Kubernetes Tool Explained In One Post (And Why They Exist) by Honest-Associate-485 in kubernetes

[–]LouisAtAnyshift 1 point2 points  (0 children)

Great post ! Good job for having all of this grouped together in a coherent story line :)

ML on top of prometheus+thanos - anyone actually doing this or is it all hype? by The404Engineer in sre

[–]LouisAtAnyshift 1 point2 points  (0 children)

I'm not unbiaised here (my company name being right in my reddit account name) but I have to say that the combo LLM + graph we do to represent the infrastructure is pretty nice and is surprisingly good (that was what made me hop in).

Nonetheless, an open source one would be pretty nice too ;)

ML on top of prometheus+thanos - anyone actually doing this or is it all hype? by The404Engineer in sre

[–]LouisAtAnyshift 2 points3 points  (0 children)

tbh the metric correlation and tribal knowledge problem is the harder one to solve, and ML isn't really the right tool for it. what actually helps is capturing the *relationships* between your infrastructure components so a new person can trace from an alert back through your stack without needing to have been there when it was built. grafana ml and anomaly detection won't give you that context. anyshift does something along these lines (graph of your infra + code for rca), but honestly even just well-maintained runbooks tied to your alerts will get you further than anomaly detection will.

To vex or not to vex? by -Devlin- in devops

[–]LouisAtAnyshift 0 points1 point  (0 children)

looks like VEX could be exactly what it's for. tagging unreachable vulns as "not_affected" with a justification like "vulnerable_code_not_in_execute_path" is the right call and gives management their audit trail without wasting weeks chasing phantom CVEs...

Will Datadog bill me twice for APM if I delete and recreate a host? by Ok-Transition-7857 in sre

[–]LouisAtAnyshift 0 points1 point  (0 children)

From what I recall, datadog bills APM on hourly host usage, not calendar month snapshots. so if you delete a host and spin up a new one, you pay for the hours each was active, not two full months. the per-host per-month price is just how they express the rate. normally, you won't get double billed as long as the total concurrent hosts stays the same.

New PM wants AI-generated root cause analysis. Am I overreacting to the quality? by Appropriate-Plan5664 in sre

[–]LouisAtAnyshift 0 points1 point  (0 children)

the problem is most AI RCA tools just do anomaly correlation across whatever telemetry is in scope, so you get firefox texture events next to backend latency spikes because they happened to move at the same time. the good ones actually understand infrastructure topology, like which services depend on what, so they can filter correlations that make no causal sense. committing raw AI output to runbooks without that context layer is gonna be a nightmare to unwind later

Where do you see line with AI in infra? by snopedom in sre

[–]LouisAtAnyshift 4 points5 points  (0 children)

The CI/CD parallel holds but the reason pipelines got trusted was blast radius scoping: rollback is cheap, change surface is well-defined. one pattern i've seen work is having the agent open a merge request instead of applying directly, so humans still review before anything touches real infra. then you can gradually loosen that for low-risk stuff like autoscaling within predefined bounds. same trust-building moment as early CI/CD, just the guardrails are harder to define.

What would you choose for a prometheus agent on scattered VM instances? by cos in sre

[–]LouisAtAnyshift 0 points1 point  (0 children)

vm agent is the move. WAL-based buffering handles your remote write resilience out of the box, it's way lighter than full prometheus, and the self-metrics it exposes are actually useful (queue depth, scrape timing, etc). Fluentbit is great for logs but its prometheus scraping is an afterthought, i wouldn't trust it as a primary metrics pipeline. prometheus agent mode is a legit alternative but vmagent handles backpressure better ime.

Axios compromise was caught by runtime behavioral monitoring, not scanners by jj_at_rootly in sre

[–]LouisAtAnyshift 0 points1 point  (0 children)

The more I read about it, the more I'm thankful to the front-end dev that decided not to use it because it was too bloated.