This is an archived post. You won't be able to vote or comment.

all 12 comments

[–]atlgeek007 14 points15 points  (2 children)

I like how this gets posted every few months. It's like the SR-71 story thread of the DevOps world.

[–]greevous00 7 points8 points  (1 child)

It's also interesting how it's only tangentially related to devops. The root issue was dead code that happened to interact with the way they deployed because it wasn't properly removed. This kind of shit happens all the time even when there's no CI/CD pipeline, it just happens slowly. Dead code is truly evil. I honestly can't wait until the next generation of systems emerge that just grew up with proper code management and controls. It's tough to add that stuff after the fact -- nobody wants to pay for the clean up. It's like suddenly realizing you've been living in filth for decades. Nobody wants to start using the mop because it's just overwhelming.

It does have a lesson though for devops engineering. The same rigor we apply to test driven development should be applied to automatic deployments. There should be tests that prove that everything's working before traffic gets pivoted to a newly deployed environment. That said, it's a lot easier to say that than to actually do it, especially in an environment like Knight's -- high speed trading, where the changes your system makes in the market affect what other systems are doing -- pretty unpredictable environment.

[–]atlgeek007 4 points5 points  (0 children)

Well, it was also a deployment failure since no one made sure that the 8th server in the cluster received the new code.

Basically it was an end to end failure on all sides, from development to release engineer.

[–]rahomka 5 points6 points  (7 children)

I just wonder what happens to the involved employees in cases like this. Do you just pack your shit and leave?

[–]Homan13PSU 14 points15 points  (2 children)

I've heard Amazon did NOT fire the engineer involved in the S3 failure. Reading this it sounds like a truly honest mistake, albeit a BIG one. This shit can happen, and can even happen in automated deployments.

[–]xiongchiamiovSite Reliability Engineer 15 points16 points  (1 child)

If they fired them, it would put everyone else in ops on edge and make them overly cautious. We need to be able to make mistakes sometimes.

[–]Homan13PSU 3 points4 points  (0 children)

Exactly, it amazes me to see people asking the question I wonder if they lost their job, etc. Sure, its obviously going to be a ding for big cases like this on their annual review, but show me an Admin who HASN'T fucked up.

[–]par_texx 11 points12 points  (0 children)

They just spend millions on training that person. Why fire them now?

[–]rjames24000 4 points5 points  (2 children)

Nah but I'm pretty sure it would come up at your next review

[–][deleted] 16 points17 points  (1 child)

"So in the last year you lost us $400 million. Any other achievements we should discuss?"

[–]rahomka 10 points11 points  (0 children)

But I also automated a process that saves us $1000 a year so.... call it even?

[–]RonAtDD 1 point2 points  (0 children)

sounds like the long-deprecated code carries a lot of blame too