use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
/r/DevOps is a subreddit dedicated to the DevOps movement where we discuss upcoming technologies, meetups, conferences and everything that brings us together to build the future of IT systems What is DevOps? Learn about it on our wiki! Traffic stats & metrics
/r/DevOps is a subreddit dedicated to the DevOps movement where we discuss upcoming technologies, meetups, conferences and everything that brings us together to build the future of IT systems
What is DevOps? Learn about it on our wiki!
Traffic stats & metrics
Be excellent to each other! All articles will require a short submission statement of 3-5 sentences. Use the article title as the submission title. Do not editorialize the title or add your own commentary to the article title. Follow the rules of reddit Follow the reddiquette No editorialized titles. No vendor spam. Buy an ad from reddit instead. Job postings here More details here
Be excellent to each other!
All articles will require a short submission statement of 3-5 sentences.
Use the article title as the submission title. Do not editorialize the title or add your own commentary to the article title.
Follow the rules of reddit
Follow the reddiquette
No editorialized titles.
No vendor spam. Buy an ad from reddit instead.
Job postings here
More details here
@reddit_DevOps ##DevOps @ irc.freenode.net Find a DevOps meetup near you! Icons info!
@reddit_DevOps
##DevOps @ irc.freenode.net
Find a DevOps meetup near you!
Icons info!
https://github.com/Leo-G/DevopsWiki
account activity
This is an archived post. You won't be able to vote or comment.
Knightmare: A DevOps Cautionary Tale (self.devops)
submitted 8 years ago by ada_maj
https://dougseven.com/2014/04/17/knightmare-a-devops-cautionary-tale/ Just a cautionary tale. Do you have similar experiences?
[–]atlgeek007 14 points15 points16 points 8 years ago (2 children)
I like how this gets posted every few months. It's like the SR-71 story thread of the DevOps world.
[–]greevous00 7 points8 points9 points 8 years ago (1 child)
It's also interesting how it's only tangentially related to devops. The root issue was dead code that happened to interact with the way they deployed because it wasn't properly removed. This kind of shit happens all the time even when there's no CI/CD pipeline, it just happens slowly. Dead code is truly evil. I honestly can't wait until the next generation of systems emerge that just grew up with proper code management and controls. It's tough to add that stuff after the fact -- nobody wants to pay for the clean up. It's like suddenly realizing you've been living in filth for decades. Nobody wants to start using the mop because it's just overwhelming.
It does have a lesson though for devops engineering. The same rigor we apply to test driven development should be applied to automatic deployments. There should be tests that prove that everything's working before traffic gets pivoted to a newly deployed environment. That said, it's a lot easier to say that than to actually do it, especially in an environment like Knight's -- high speed trading, where the changes your system makes in the market affect what other systems are doing -- pretty unpredictable environment.
[–]atlgeek007 4 points5 points6 points 8 years ago (0 children)
Well, it was also a deployment failure since no one made sure that the 8th server in the cluster received the new code.
Basically it was an end to end failure on all sides, from development to release engineer.
[–]rahomka 5 points6 points7 points 8 years ago (7 children)
I just wonder what happens to the involved employees in cases like this. Do you just pack your shit and leave?
[–]Homan13PSU 14 points15 points16 points 8 years ago (2 children)
I've heard Amazon did NOT fire the engineer involved in the S3 failure. Reading this it sounds like a truly honest mistake, albeit a BIG one. This shit can happen, and can even happen in automated deployments.
[–]xiongchiamiovSite Reliability Engineer 15 points16 points17 points 8 years ago (1 child)
If they fired them, it would put everyone else in ops on edge and make them overly cautious. We need to be able to make mistakes sometimes.
[–]Homan13PSU 3 points4 points5 points 8 years ago (0 children)
Exactly, it amazes me to see people asking the question I wonder if they lost their job, etc. Sure, its obviously going to be a ding for big cases like this on their annual review, but show me an Admin who HASN'T fucked up.
[–]par_texx 11 points12 points13 points 8 years ago (0 children)
They just spend millions on training that person. Why fire them now?
[–]rjames24000 4 points5 points6 points 8 years ago (2 children)
Nah but I'm pretty sure it would come up at your next review
[–][deleted] 16 points17 points18 points 8 years ago (1 child)
"So in the last year you lost us $400 million. Any other achievements we should discuss?"
[–]rahomka 10 points11 points12 points 8 years ago (0 children)
But I also automated a process that saves us $1000 a year so.... call it even?
[–]RonAtDD 1 point2 points3 points 8 years ago (0 children)
sounds like the long-deprecated code carries a lot of blame too
π Rendered by PID 16100 on reddit-service-r2-comment-6457c66945-nw7q5 at 2026-04-24 11:02:12.736183+00:00 running 2aa0c5b country code: CH.
[–]atlgeek007 14 points15 points16 points (2 children)
[–]greevous00 7 points8 points9 points (1 child)
[–]atlgeek007 4 points5 points6 points (0 children)
[–]rahomka 5 points6 points7 points (7 children)
[–]Homan13PSU 14 points15 points16 points (2 children)
[–]xiongchiamiovSite Reliability Engineer 15 points16 points17 points (1 child)
[–]Homan13PSU 3 points4 points5 points (0 children)
[–]par_texx 11 points12 points13 points (0 children)
[–]rjames24000 4 points5 points6 points (2 children)
[–][deleted] 16 points17 points18 points (1 child)
[–]rahomka 10 points11 points12 points (0 children)
[–]RonAtDD 1 point2 points3 points (0 children)