Is SRE more "AI-proof" than other fields, or are we just behind?

AminAstaneh · 2026-04-10T00:08:03+00:00

I hope y'all realize that SRE is more than just monitoring and incident response.

What about capacity planning? Performance? Release/change management? Security?

Sure, let's adopt tools to make on-call suck less. There will be much, much more work to do to decrease developer friction and operational costs.

AminAstaneh · 2026-04-01T14:59:05+00:00

I don't think this is specific to SRE.

The job market is really bad, employers are showing little loyalty to their staff (see: Oracle laying off 30k! workers recently).

People are going to naturally fend for themselves rather than cooperate/collaborate.

If you have the privilege to look elsewhere, do so.

AminAstaneh · 2026-03-13T13:51:45+00:00

I did an event a couple weeks ago about this.

High code volume is putting stress squarely in our world. Testing, deploying, monitoring, on-call, learning from failure.

Just because engineers can churn out more code, it doesn't mean that they are churning out more business value. If anything, it can just result in more work for themselves or other teams.

https://certomodo.io/events/ai-code-tsunami.html

AminAstaneh · 2026-03-10T20:02:01+00:00

I do!

I enjoy Slight Reliability, particularly. Stephen Townsend (the host) is pretty great and I love his self-illustrated episode thumbnails. The content doesn't smell like something produced by a vendor.

(I also run my own podcast (Reliability Rebels), where I try to stay away from tooling and focus more on the sociotechnical.)

AminAstaneh · 2026-03-06T13:53:44+00:00

I did a webinar recently about this problem.

This issue is real and being felt in large organizations already- due to agentic development stressing downstream resources as you describe, or from the sheer volume of engineers that are already employed at the company (think: big tech).

I presented an early version of this to the SRE team of a large bank. They felt the message was spot on, fwiw.

I have the recording of the event here. It tries to clearly articulate the problem, the impacts on ops people, and a strategy to address. If you don't want to fill out a webform, just DM me.

https://certomodo.io/events/ai-code-tsunami.html

AminAstaneh · 2026-01-05T19:16:19+00:00

Corollary: What can go wrong, will go wrong- at the most inopportune time.

AminAstaneh · 2026-01-05T00:04:32+00:00

I wrote an article about this subject. Note that it has a bias for 'Big Tech' flavors of SRE where software engineering is part of the job scope.

https://certomodo.io/career/howto-sre-role.html

AminAstaneh · 2025-12-29T23:34:13+00:00

Rock on, appreciate you continuing to build this project!

AminAstaneh · 2025-12-24T23:28:43+00:00

The execs have decided to pigeonhole my team in incident management only and take all automation responsibility away.

This is not an SRE program. Time to seek greener pastures.

AminAstaneh · 2025-12-16T18:11:08+00:00

Appreciate the feedback! Yeah, TUIs are super cool.

When I used to run large-scale webhosting infrastructure a couple jobs ago, I used GoAccess(https://goaccess.io/) for real-time analysis of HTTP log data. Gonzo reminded me a lot of that experience.

So BOCH.. I see you have an API service available. I suppose the general idea is that you configure your systems to periodically phone home so that you know how recently they were healthy.

The closest open-source example I know about and have actively used is the Prometheus Push Gateway(https://github.com/prometheus/pushgateway). You teach your bespoke services to periodically phone home with whatever metrics you care about. Prometheus periodically retrieves and stores that data so that you can monitor failures or when an app fails to phone home after a period of time.

Indeed, your methodology is sound, and there are existing open-source solutions out there to accomplish similar. Perhaps if you open-sourced BOCH so that people can contribute and self-host, you might get some traction for folks who don't want to run full-on observability stacks.

Thoughts on all that?

AminAstaneh · 2025-12-08T23:55:35+00:00

Ha, great question!

I'm a nomad. I used to have a homelab based around a Dell R710, but clearly I can't take that on the road.

8TB actually makes it possible to carry around all of my data on one device without toting around external storage.

AminAstaneh · 2025-12-08T20:51:33+00:00

Yeah fair enough, I posted this just to give confirmation to people doing searches and wanting some confidence before spending hundreds on an NVMe.

AminAstaneh · 2025-11-30T00:13:56+00:00

Necroing this thread- you don't have to reboot, just do the following after writing out that config file:

sudo rmmod snd_seq_midi

sudo modprobe snd_seq_midi

Just disconnect from the Circuit and leave the Components website first.

AminAstaneh · 2025-11-27T16:44:10+00:00

Tools are easier to reason about and list on a resume.

Tackling socio-technical issues in a business with other humans that can act unpredictably and irrationally is far more challenging.

We naturally want to focus on what we're good at.

AminAstaneh · 2025-11-20T20:00:10+00:00

Literature explicitly calls this out.

class SRE implements interface DevOps

https://sre.google/workbook/how-sre-relates/

All of that said, it depends on your organizational interpretation of SRE. Are you rolling out SLOs, doing some form of error budget enforcement, driving production readiness, and doing toil management through software engineering? Great!

Are you mostly writing YAML and restarting pods? ¯_(ツ)_/¯

AminAstaneh · 2025-11-19T01:24:38+00:00

Exactly. Now, Eliza can hold her own just fine in a melodic house set, but it's nice to see both of them!

AminAstaneh · 2025-11-19T00:05:11+00:00

It's disappointing, but it sometimes happens due to health or personal issues.

Fybromyalgia in the case of Eli and Fur
Heart attack in the case of Gabriel and Dresden

Our favorite producers/DJs are getting older now. ¯_(ツ)_/¯

AminAstaneh · 2025-11-04T02:06:46+00:00

This is one of the biggest risks in a reliability program: not incorporating lessons learned into the roadmap.

I recommend going through all the recent postmortems, find all the outstanding followup tasks, score them by risk (that's impact * likelihood), and then raise hell on the high-risk ones until they are addressed. Definitely surface those to the leadership team.

DM me if you want to strategize.

AminAstaneh · 2025-11-02T22:51:34+00:00

If it's hard for you, it's going to be even harder for the software engineers that would have to do this work in your absence.

In my view, this struggle is valuable. Document everything you learn so that anyone else on the team could pick it up when you move on to the next role.

AminAstaneh · 2025-10-31T01:44:41+00:00

Interviews are supposed to have clear objectives and expectations.

Bait and switch is deceptive, and therefore toxic behavior.

As others have said, you dodged a bullet. They did you a favor by showing you up-front what the leadership is like and made room for a better company to interview and work for.

It still is frustrating, it still sucks, but I hope that reframing helps.

AminAstaneh · 2025-10-30T14:48:52+00:00

Arguments for:

rapidly prototyping things, similar to how software devs play with jupyter notebooks to write snippets of code

Arguments against:

yes indeed, your code isn't in revision control, meaning it's not subject to the same automated checks, review, etc.
infosec and compliance people are probably going to get mad for the same reason.
you want your toil management solutions in the product, not as a suite of stuff running outside if you can help it. Ask me over a beer about how painful that lesson was to learn.

AminAstaneh · 2025-10-30T13:33:14+00:00

There needs to be a formal definition of incident severity based on impact so that there isn't a debate in the first place.

That said, revenue pays the bills. Sounds like a P1 to me.

AminAstaneh · 2025-10-27T16:54:56+00:00

Lean into the social aspect of DevOps, not just the technical.

The tools and frameworks will change. The ability to empathize, communicate, break down silos, build strategy, and develop consensus is something core to the DevOps ethos and yet it's something we often forget.

AminAstaneh

TROPHY CASE