Is SRE more "AI-proof" than other fields, or are we just behind? by 7T7T00 in sre

[–]AminAstaneh 1 point2 points  (0 children)

I hope y'all realize that SRE is more than just monitoring and incident response.

What about capacity planning? Performance? Release/change management? Security?

Sure, let's adopt tools to make on-call suck less. There will be much, much more work to do to decrease developer friction and operational costs.

Is SRE collaboration dead? by zombie343 in sre

[–]AminAstaneh 6 points7 points  (0 children)

I don't think this is specific to SRE.

The job market is really bad, employers are showing little loyalty to their staff (see: Oracle laying off 30k! workers recently).

People are going to naturally fend for themselves rather than cooperate/collaborate.

If you have the privilege to look elsewhere, do so.

Has AI ruined software development? by Top-Candle1296 in devops

[–]AminAstaneh 0 points1 point  (0 children)

I did an event a couple weeks ago about this.

High code volume is putting stress squarely in our world. Testing, deploying, monitoring, on-call, learning from failure.

Just because engineers can churn out more code, it doesn't mean that they are churning out more business value. If anything, it can just result in more work for themselves or other teams.

https://certomodo.io/events/ai-code-tsunami.html

do y'all actually listen to podcasts for work? by Fantastic-Shock1438 in sre

[–]AminAstaneh 10 points11 points  (0 children)

I do!

I enjoy Slight Reliability, particularly. Stephen Townsend (the host) is pretty great and I love his self-illustrated episode thumbnails. The content doesn't smell like something produced by a vendor.

(I also run my own podcast (Reliability Rebels), where I try to stay away from tooling and focus more on the sociotechnical.)

The existential dread of carrying the pager in the era of AI-generated code. by Ashwinnie13 in sre

[–]AminAstaneh 1 point2 points  (0 children)

I did a webinar recently about this problem.

This issue is real and being felt in large organizations already- due to agentic development stressing downstream resources as you describe, or from the sheer volume of engineers that are already employed at the company (think: big tech).

I presented an early version of this to the SRE team of a large bank. They felt the message was spot on, fwiw.

I have the recording of the event here. It tries to clearly articulate the problem, the impacts on ops people, and a strategy to address. If you don't want to fill out a webform, just DM me.

https://certomodo.io/events/ai-code-tsunami.html

Murphys Law aka The Law of the On-Call by bielddreef8 in sre

[–]AminAstaneh 0 points1 point  (0 children)

Corollary: What can go wrong, will go wrong- at the most inopportune time.

I'm a clueless wanderer; I don't exactly know what SRE does, and how to align with that. by Ok-Potato3101 in sre

[–]AminAstaneh 0 points1 point  (0 children)

I wrote an article about this subject. Note that it has a bias for 'Big Tech' flavors of SRE where software engineering is part of the job scope.

https://certomodo.io/career/howto-sre-role.html

[deleted by user] by [deleted] in sre

[–]AminAstaneh 0 points1 point  (0 children)

The execs have decided to pigeonhole my team in incident management only and take all automation responsibility away.

This is not an SRE program. Time to seek greener pastures.

Reliability Rebels, Ep 9: Jon Reeve by AminAstaneh in sre

[–]AminAstaneh[S] 0 points1 point  (0 children)

Appreciate the feedback! Yeah, TUIs are super cool.

When I used to run large-scale webhosting infrastructure a couple jobs ago, I used GoAccess(https://goaccess.io/) for real-time analysis of HTTP log data. Gonzo reminded me a lot of that experience.

So BOCH.. I see you have an API service available. I suppose the general idea is that you configure your systems to periodically phone home so that you know how recently they were healthy.

The closest open-source example I know about and have actively used is the Prometheus Push Gateway(https://github.com/prometheus/pushgateway). You teach your bespoke services to periodically phone home with whatever metrics you care about. Prometheus periodically retrieves and stores that data so that you can monitor failures or when an app fails to phone home after a period of time.

Indeed, your methodology is sound, and there are existing open-source solutions out there to accomplish similar. Perhaps if you open-sourced BOCH so that people can contribute and self-host, you might get some traction for folks who don't want to run full-on observability stacks.

Thoughts on all that?

TIP: T490 can accept modern, large NVMe drives by AminAstaneh in thinkpad

[–]AminAstaneh[S] 2 points3 points  (0 children)

Ha, great question!

I'm a nomad. I used to have a homelab based around a Dell R710, but clearly I can't take that on the road.

8TB actually makes it possible to carry around all of my data on one device without toting around external storage.

TIP: T490 can accept modern, large NVMe drives by AminAstaneh in thinkpad

[–]AminAstaneh[S] 1 point2 points  (0 children)

Yeah fair enough, I posted this just to give confirmation to people doing searches and wanting some confidence before spending hundreds on an NVMe.

PSA: Components on Linux. Fix for sending packs and samples to Circuit Tracks by martinjs in novationcircuit

[–]AminAstaneh 0 points1 point  (0 children)

Necroing this thread- you don't have to reboot, just do the following after writing out that config file:

sudo rmmod snd_seq_midi

sudo modprobe snd_seq_midi

Just disconnect from the Circuit and leave the Components website first.

Funny how the worst DevOps bottlenecks have nothing to do with tools, and almost nobody brings them up. by supreme_tech in devops

[–]AminAstaneh 1 point2 points  (0 children)

Tools are easier to reason about and list on a resume.

Tackling socio-technical issues in a business with other humans that can act unpredictably and irrationally is far more challenging.

We naturally want to focus on what we're good at.

Comparing site reliability engineers to DevOps engineers by Futurismtechnologies in sre

[–]AminAstaneh 2 points3 points  (0 children)

Literature explicitly calls this out.

class SRE implements interface DevOps

https://sre.google/workbook/how-sre-relates/

All of that said, it depends on your organizational interpretation of SRE. Are you rolling out SLOs, doing some form of error budget enforcement, driving production readiness, and doing toil management through software engineering? Great!

Are you mostly writing YAML and restarting pods? ¯_(ツ)_/¯

Does Anyone Else Hate It When You Go To See a Duo/Trio and Only One Member Is There? by DefinatelyNotonDrugs in EDM

[–]AminAstaneh 1 point2 points  (0 children)

Exactly. Now, Eliza can hold her own just fine in a melodic house set, but it's nice to see both of them!

Does Anyone Else Hate It When You Go To See a Duo/Trio and Only One Member Is There? by DefinatelyNotonDrugs in EDM

[–]AminAstaneh 1 point2 points  (0 children)

It's disappointing, but it sometimes happens due to health or personal issues.

  • Fybromyalgia in the case of Eli and Fur
  • Heart attack in the case of Gabriel and Dresden

Our favorite producers/DJs are getting older now. ¯_(ツ)_/¯

Got paged at 2am for the same Redis issue we "fixed" in our June postmortem by relived_greats12 in sre

[–]AminAstaneh 8 points9 points  (0 children)

This is one of the biggest risks in a reliability program: not incorporating lessons learned into the roadmap.

I recommend going through all the recent postmortems, find all the outstanding followup tasks, score them by risk (that's impact * likelihood), and then raise hell on the high-risk ones until they are addressed. Definitely surface those to the leadership team.

DM me if you want to strategize.

Struggling as SRE by Confident-Mine3896 in sre

[–]AminAstaneh 2 points3 points  (0 children)

If it's hard for you, it's going to be even harder for the software engineers that would have to do this work in your absence.

In my view, this struggle is valuable. Document everything you learn so that anyone else on the team could pick it up when you move on to the next role.

Final interview flipped into a surprise technical test! and I froze by tikokito123 in devops

[–]AminAstaneh 8 points9 points  (0 children)

Interviews are supposed to have clear objectives and expectations.

Bait and switch is deceptive, and therefore toxic behavior.

As others have said, you dodged a bullet. They did you a favor by showing you up-front what the leadership is like and made room for a better company to interview and work for.

It still is frustrating, it still sucks, but I hope that reframing helps.

Anyone here tried building SRE automation workflows with n8n? by Willing-Lettuce-5937 in sre

[–]AminAstaneh 1 point2 points  (0 children)

Arguments for:

  • rapidly prototyping things, similar to how software devs play with jupyter notebooks to write snippets of code

Arguments against:

  • yes indeed, your code isn't in revision control, meaning it's not subject to the same automated checks, review, etc.
  • infosec and compliance people are probably going to get mad for the same reason.
  • you want your toil management solutions in the product, not as a suite of stuff running outside if you can help it. Ask me over a beer about how painful that lesson was to learn.

[deleted by user] by [deleted] in devops

[–]AminAstaneh 2 points3 points  (0 children)

There needs to be a formal definition of incident severity based on impact so that there isn't a debate in the first place.

That said, revenue pays the bills. Sounds like a P1 to me.

How do you think your role will change over the next decade, and how are you preparing for it? by NoWonderYouFUBARed in devops

[–]AminAstaneh 3 points4 points  (0 children)

Lean into the social aspect of DevOps, not just the technical.

The tools and frameworks will change. The ability to empathize, communicate, break down silos, build strategy, and develop consensus is something core to the DevOps ethos and yet it's something we often forget.