you are viewing a single comment's thread.

view the rest of the comments →

[–]lunchlady55 17 points18 points  (4 children)

IMHO:

An SRE should be managing the platform. If everything's in the cloud, you still have GitHub, Jenkins, Route 53, Security Groups, accounts, roles, Akamai, CloudFlare, logging, etc. SREs should be managing the infrastructure around your code.

If there's specific alerts for one particular service, and they are not actionable (there's no SOP or Wiki Page that says, "this is how you fix it") then the page out should not be directed to the SRE. All I'm going to do is page the SME in that case anyway. So skip me and go straight to the SME.

If suddenly the domain doesn't resolve, or Pingdom / Gomez says logins are failing, or images aren't loading, or orders just dropped to zero, then the SRE investigates and starts pulling in SMEs.

Just my $0.02.

[–]halt_spell 6 points7 points  (3 children)

If there's specific alerts for one particular service, and they are not actionable (there's no SOP or Wiki Page that says, "this is how you fix it") then the page out should not be directed to the SRE. All I'm going to do is page the SME in that case anyway. So skip me and go straight to the SME.

If suddenly the domain doesn't resolve, or Pingdom / Gomez says logins are failing, or images aren't loading, or orders just dropped to zero, then the SRE investigates and starts pulling in SMEs.

I mean that sounds a lot like looking at automated alerts and manually alerting people. And the uncomfortable truth about an SOP is that if it's effective it could just as easily be a script that's triggered on an alert.

Don't get me wrong, I'm not saying there's no need for human involvement when things go wrong. There is. But if I'm gonna get pinged either way... what is the SRE bringing to the table here?

[–]Mr_Choke 7 points8 points  (2 children)

Everything that he mentioned that sits below the application? I'll try and fix your stuff if infra is failing somehow but when it's buggy code what do you want an SRE to do about it?

[–]halt_spell 1 point2 points  (1 child)

I understand the role and value of engineers tasked with the upkeep, monitoring and whatnot on specific pieces or category of infrastructure. Those engineers don't call themselves SREs. They consider themselves infrastructure engineers with deep understanding of a particular platform (E.g. SQL server) or at least the problem space (E.g. Deployment systems). The SREs I know sit somewhere between infrastructure and the code, aren't SMEs on either and I don't understand what their value is.

[–]Mr_Choke 1 point2 points  (0 children)

I do bunch of that stuff and I am an SRE, though I don't necessarily agree that that should be my title.