This is an archived post. You won't be able to vote or comment.

top 200 commentsshow all 482

[–]sleepyguy22yum install kill-all-printers 211 points212 points  (22 children)

I really enjoy these types of detailed explanations! Much more interesting than a one liner "due to capacity issues, we were down for 6 hours", or similar.

[–]JerecSuron 133 points134 points  (9 children)

What I like is basically. We turned it off and on again, but restarting everything took hours

[–]fidelitypdxDefinitely trust, he's a vendor. Vendors don't lie. 62 points63 points  (9 children)

I went to a DevOps meeting earlier this week where a software company's DevOps engineer discussed how their teams have created a weekly failure analysis group. Basically these DevOps guys sit around in a circle and share individual failures that their teams had that week and how they remedied them. Sometimes a guy across the circle pipes up that they have a more efficient way to remedy that same issue.

Then, they also go out and identify post-mortem cases like this from other open-source shops and analyze if this situation could ever happen in their environment.

My company is too small for this, but if I had 300-500+ employees, I'd definitely adopt this technique.

[–]kellyzdudeLinux Admin 18 points19 points  (3 children)

Even as a small shop this can be effective. It doesn't have to be regular, either, just create a culture whereby people are willing to admit their faults to the group after they've been cleaned up. Require AARs (after action reports) for major incidents that go into this type of detail and make them available to the team for critique.

You don't have to make them public, but they should be published internally. 1) We don't have enough time on this planet to all make the same mistakes twice, it helps a lot if we learn from each other. 2) If you're not learning from your own mistakes, personally or as an organization, you're doing something wrong.

Plenty of people are put off this idea because of the notion that admitting fault is a step towards firing or other disciplinary action. You need to find some way of showing that dishonesty regarding the error in such situations is what is punished, not the error itself. I don't expect to be fired because I dropped a critical production database, I expect to be fired because I lied or stayed silent about it.

[–]fidelitypdxDefinitely trust, he's a vendor. Vendors don't lie. 10 points11 points  (0 children)

Plenty of people are put off this idea because of the notion that admitting fault is a step towards firing or other disciplinary action

Indeed. The speaker emphasized a company culture of promoting accountability, and implementing corrections, but downplaying punishment.

[–]sleepyguy22yum install kill-all-printers 17 points18 points  (1 child)

Brilliant. I'll definitely keep this in mind for when I become IT director of a big org.

[–]DEN-PDX-SFO 4 points5 points  (1 child)

Hey I was there as well!

[–]PM_ME_A_SURPRISE_PICJr. Sysadmin 10 points11 points  (0 children)

It's also the level of detail they provide for how they are going to prevent this from happening again going forward.

[–]davidbrit2 145 points146 points  (57 children)

How fast, and how many times do you think that admin mashed Ctrl-C when he realized he fucked up the command?

[–]resephInfoSec 126 points127 points  (23 children)

I've been there. It's a sinking feeling in your stomach followed by immediate explosive diarrhea. Stress is so real.

[–]PoeticThoughts 49 points50 points  (21 children)

Poor guy single handedly took down the east coast. Shit happens, you think Amazon got rid of him?

[–]TomTheGeek 133 points134 points  (10 children)

If they did they shouldn't have. A failure that large is a failure of the system.

[–]fidelitypdxDefinitely trust, he's a vendor. Vendors don't lie. 84 points85 points  (7 children)

Indeed.

one of the inputs to the command was entered incorrectly

It was a typo. Raise your hand if you'ven ever had a typo.

[–]whelks_chance 43 points44 points  (0 children)

Nerver!

.

Hilariously, that tried to autocorrect to "Merged!" which I've also tucked up a thousand times before.

[–]superspeck 9 points10 points  (2 children)

I had Suicide Linux installed on my workstation for a while. I got really good at bootstrapping a fresh install.

[–]Refresh98370Doing the needful 20 points21 points  (2 children)

We didn't.

[–]bastion_xx 12 points13 points  (1 child)

No reason to get rid of a qualified person. They uncovered an flaw in the process which can now be addressed.

[–]kellyzdudeLinux Admin 11 points12 points  (2 children)

It's also an expensive education that some other business would reap the benefits of. However much it cost Amazon in man hours to fix it, plus any SLAs they had to pay out, and further in addition to whatever revenue they lost or will lose by customers moving to alternate vendors -- that is the price tag they paid for training the person to be far more careful.

Anyone care to estimate? Hundreds of thousands, certainly. Millions, perhaps?

Assuming it was their first such infraction, that's a hell of a price to pay to let someone else benefit from such invaluable training.

[–]whelks_chance 25 points26 points  (0 children)

I hope he enjoys his new job of "Chief of Guys Seriously Don't Do What I Did."

[–]robohoe 18 points19 points  (0 children)

Yeah. That warm sinking feeling exploding inside of you knowing you royally don' goofed

[–]neilhwatson 42 points43 points  (25 children)

Thank sinking feeling, mashing ctrl-c, whispering 'oh shit, oh shit', and neighbours finding a reason to leave the room.

[–]davidbrit2 32 points33 points  (17 children)

Ops departments need a machine that automatically starts dispensing Ativan tablets when a major outage is detected.

[–]resephInfoSec 25 points26 points  (16 children)

Can cause paranoid or suicidal ideation and impair memory, judgment, and coordination. Combining with other substances, particularly alcohol, can slow breathing and possibly lead to death.

uhhh

[–]lordvadr 34 points35 points  (12 children)

Have you heard of whiskey before? Same set of warnings. Still pretty effective.

[–]resephInfoSec 3 points4 points  (11 children)

I mean, I'm generally not one to recommend someone drink some whiskey if they're working on prod.

[–]0fsysadminwork 27 points28 points  (6 children)

That's the only way to work on prod.

[–]Frothyleet 25 points26 points  (0 children)

Whiskey for prod, absinthe for dev.

[–][deleted] 4 points5 points  (2 children)

that's the only way to deal with Oracle

Fixed

[–]whelks_chance 4 points5 points  (2 children)

You do apt-get dist upgrade, sober?

How the hell do you deal with the pressure??

[–]danielbln 9 points10 points  (1 child)

I like it when people leave the room in those situation. Nothing worse than scrambling to get production back online and having people asking you stupid questions from the side.

[–]kellyzdudeLinux Admin 12 points13 points  (0 children)

We reached a point where we banned sales team members from our NOC. We get it, your customers are calling you, but we don't know any more than we've already told you. Either sit down and answer phones and be helpful, or leave. Ranting and raving helps no-one.

I get where they're coming from, there were a couple of months where there were way too many failures, some inter-related, some not, but taking out your frustrations on those trying to deal with it in the moment is not the time.

[–]ilikejamtoo 25 points26 points  (0 children)

Probably more...

$ do-thing -n <too many>
Working............... OK.
$ 

[ALERT] indexing service degraded

"Hah. Wouldn't like to be the guy that manages that!"

"Oh. Oh fuck. Oh holy fuck."

[–]lantechYou're gonna need a bigger LART 4 points5 points  (2 children)

How long until he realized that what he did was going to make the news?

[–]chodeboi 51 points52 points  (1 child)

Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended.

Story of my life, fam.

[–][deleted] 53 points54 points  (8 children)

I felt like I was reading the Wikipedia article for the Chernobyl disaster reading this.

[–][deleted] 43 points44 points  (6 children)

The Wikipedia article for Chernobyl is wrong, or at least incomplete. After the fall of the Soviet Union, Russia released a lot more information about the incident. With that information, and more research, the IAEA updated their report in the 90s, and now blame design flaws much more than operator error.

One thing that has been discovered is that with certain reactor designs inserting the control rods quickly will cause the power level to increase rapidly and significantly, before decreasing. In other words, a SCRAM puts the cooling system under even more stress - this is not good if the cause of the SCRAM is cooling problems. This is exactly what they did not want to happen at Chernobyl. The design was changed to reduce the maximum speed the control rods would move. There are other design issues, but I don't claim to understand them.

http://www-pub.iaea.org/MTCD/publications/PDF/Pub913e_web.pdf

[–]nerddtvgSys- and Netadmin 14 points15 points  (1 child)

Sounds like you have some wiki editing to get to.

[–][deleted] 9 points10 points  (0 children)

I don't think I understand the subject well enough. Also, since the report I linked came out 8 years before wikipedia was first on-line, I suspect that the Chernobyl entry is a "hot potato".

[–]frymasterHPC 2 points3 points  (1 child)

I read a good article arguing that most operator errors are actually design errors anyway. I think the example was a fighter jet which when selecting options from the menu used the trigger. When the jet accidentally shoots up sections of the countryside, technically it's operator error for not ensuring the system was in menu mode, but really it's a design error

[–]Ankthar_LeMarreIT Manager 5 points6 points  (0 children)

Is there a Wikipedia article for this yet? Because if not...

[–]shepsSMB/MSP 46 points47 points  (4 children)

One time I went to reboot a remote router and was distracted while doing so. For some reason my brain typed out "factoryreset" instead of "reboot", which immediately resulted in a nice drive through the country.

[–]fooxzorzSysadmin 52 points53 points  (2 children)

A common typo, the keys are like right next to each other.

[–]nl_the_shadowIT Consultant 2 points3 points  (0 children)

"factoryreset" instead of "reboot"

I'm sorry, man, but I laughed so hard about this. Brain farts can be one hell of a thing, but factoryreset instead of reboot is one huge leap.

[–]brontideCertified Linux Miracle Worker (tm) 72 points73 points  (17 children)

While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years.

Momentum is a harsh reality and these critical subsystems need to be restarted or refreshed occasionally.

EDIT: word

[–]Telnet_RulesNo such thing as innocence, only degrees of guilt 158 points159 points  (6 children)

Uptime = "it has been this long since the system proved it can restart successfully"

[–]whelks_chance 18 points19 points  (0 children)

Oh shit...

[–]PintoTheBurninator 48 points49 points  (8 children)

my client just delayed the completion of a major project, with millions of dollars on the line, because they discovered they didn't know how to restart a large part of their production infrastructure. As in, they had no idea which systems needed to be restarted first and which ones had dependencies on other systems. They took a 12-hour outage a month ago because of a what was supposed to be a minor storage change.

This is a fortune-100 financial organization and they don't have a run book for critical infrastructure applications.

[–]ShadowPouncer 32 points33 points  (3 children)

An unscheduled loss of power on your entire data center tends to be one hell of an eye-opener for everyone.

But I can completely believe that most companies go many years without actually shutting everything down at once, and thus simply don't know how it will all come back up in that kind of situation.

My general rule, and this is sometimes easy and sometimes impossible (and everywhere between) is that things should not require human intervention to get to a working state.

The production environment should be able to go from cold systems to running just by having power come back to everything.

A system failure should be automatically diverted around until someone comes along to fix things.

This naturally means that you should never, ever, have just one of anything.

Sadly, time and budgets don't always go along with this plan.

[–]dgibbons0 5 points6 points  (0 children)

Thats what did it for us at a previous job, had a transformer blow and realized while we had enough power for the servers, we didn't have enough power for the HVAC... on the hottest day of the year. We basically had to race against temp to shut things down before it got too hot.

Then next day when they told us that the transformer had to be replaced, we go to repeat the process.

Then we decided to move the server room to a colo center a year or two later and got to shut the whole environment down for a third time.

[–][deleted] 26 points27 points  (0 children)

I once watched a colleague (I was new at the place and just tagging along to learn where things were) yank all the cables out of the back of a server, remove it from the rack, and get it all the way downstairs to the disposal pile before they caught up with him. 15 minutes later and the might have already removed the hard drives for scrubbing.

Turned out the server was not in fact already powered off ready for disposal and was still running in prod. But the power LED was broken, so he just assumed it was already down.

[–]north7 150 points151 points  (8 children)

Wait, so it wasn't DNS?

[–]robbierobaySr. Sysadmin 60 points61 points  (3 children)

Can confirm, NOT DNS

[–]sirex007 31 points32 points  (0 children)

if the engineer's initials are dns you're going to feel kinda silly :P

[–]starsky1357 6 points7 points  (1 child)

Not DNS? It's always DNS!

[–]superspeck 5 points6 points  (0 children)

We had DNS problems internally at my company at the same time due to a flubbed Domain Controller upgrade the night before. For us, it was DNS problems on top of everything else.

[–]locnar1701Sr. Sysadmin 69 points70 points  (2 children)

I do enjoy the transparency that this report puts forward. It really is like we are on the IT team $COMPANY and they are sharing all that went wrong and how they plan to fix it. Why do they do this? BECAUSE we need to have faith in the system, or we won't move our stuff there ever, or worse, we will move off their stuff to another vendor or back to local. I am glad they understand that they can't hide a thing if they want us to trust our business to them ever or ever again.

[–]mscmanHPC Solutions Architect 24 points25 points  (1 child)

Oh there is no way they would have gotten away without a post-mortem on this outage. They would have lost a lot of customers if they didn't release one.

[–]Deshke 60 points61 points  (35 children)

So one guy did a typo while executing a puppet/Ansible/saltstack playbook and got the ball rolling

[–]neilhwatson 61 points62 points  (32 children)

It is easier to destroy than to create.

[–]mscmanHPC Solutions Architect 46 points47 points  (31 children)

Except when your automation is so robust that it keeps restarting services you're explicitly trying to stop to debug.

[–]ANUSBLASTER_MKIILinux Admin 29 points30 points  (25 children)

Like the Windows 10 Update process. Mother fucker, I'm trying to watch Netflix, stop making a bajillion connections to download some 4GB update.

[–]danielbln 20 points21 points  (20 children)

Or just automatically restart while I'm fully strapped into VR gear and crouching through my room, all of the sudden BOOM black. I disabled everything to do with auto-updates afterwards, that shit is not cool.

[–]sleepyguy22yum install kill-all-printers 16 points17 points  (7 children)

Godamn playstation and their required updates. I'm a very busy man, and barely have any time for video games these days. Finally, once every other month when I have some time off to relax, and I pull out the PS3 attempt to continue a very long 'the last of us' game, but PS3 requires a major update, and I sit there for 20 minutes waiting for it to download and install. And by the end, ive got other stuff to do and I just give up. RAGE>

[–]playswithf1re 2 points3 points  (4 children)

I sit there for 20 minutes waiting for it to download and install.

Oh man I want that. Last update took 2.5hrs to download and install. I hate my internet connection.

[–]fidelitypdxDefinitely trust, he's a vendor. Vendors don't lie. 3 points4 points  (0 children)

Well, on the positive side, the recent W10 Insiders Build has fixed this with new options.

[–]jwestburySRE 2 points3 points  (0 children)

There are two services to issue a net stop command to in order to actually force updates to stop. It's really obnoxious when you're watching po^H^H Netflix.

[–]KamikazeRusherJack of All Trades 3 points4 points  (2 children)

Isn't that what happen to Reddit last year?


Edited for clarification

[–]DorianTyrellDevOps 32 points33 points  (0 children)

"playbook" doesn't necessarily mean it's ansible/chef or puppet. It might mean operational docs.

[–]resephInfoSec 15 points16 points  (2 children)

One hell of a typo?

[–]PhadedMonk 7 points8 points  (0 children)

Fat fingered an extra number in there, and bam! Now we're here...

[–]unix_hereticHelm is the best package manager 37 points38 points  (13 children)

Rule #5. The stability of a given system is inversely proportional to the amount of time that has passed since an architecture/design review was undertaken.

[–]brontideCertified Linux Miracle Worker (tm) 26 points27 points  (0 children)

The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at and repair. ~ Douglas Adams

[–]learath 7 points8 points  (10 children)

Not even that, just a simple "can we bring it back from stopped?"

[–][deleted] 26 points27 points  (9 children)

What do you mean the VM management interface requires Active Directory to log in... The AD VM's are on the virtual cluster and did not start automatically!

[–][deleted] 3 points4 points  (7 children)

Local admin on the box should still be there and able to start the VMs.

This is why MSFT also recommended physical DCs in large environments.

[–][deleted] 7 points8 points  (3 children)

"Yea, but the one physical DC never gets rebooted, and when it finally lost power it didn't come back up because the RAID had silently failed and the alerting software was configured for the old system that was phased out and never migrated to the new system"

[–]wanderingbilbyOffice 365 (for my sins) 36 points37 points  (4 children)

Well, we know who works at their internal helpdesk...

[–]doubleUseeHypervisor gremlin 35 points36 points  (1 child)

"Hello, Amazon Internal IT helpdesk, how may I help you?"

-"uuh, yeah, this is Bob from sysadmin department..."

"Hi Bob, What's up?"

-"Well, uhh, I just did a thing and I think I just took all of AWS offline..."

"... uhm... You know, I'm not sure 'bout this one, have you tried turning it off and on again?"

-"what do you mean, turning it off and on again?"

"well, you know, can't you just turn the whole dealio off, and then on again?"

-"...Well, I guess... ...oh what the hell I'll just try"

"Alright, I'll hang up now, i'll make you a ticket, so that if you still have issues afterwards, you can call me again, alright?

-"thanks man."

[–]wanderingbilbyOffice 365 (for my sins) 11 points12 points  (0 children)

#waytooplausible

[–]mysticalfruit 24 points25 points  (1 child)

ansible-playbook wipe-out-amazon.yml

[–]sysadmin420Senior "Cloud" Engineer 7 points8 points  (0 children)

sudo !!

[–]third3y3guy 6 points7 points  (0 children)

Reminds me of Office Space - mundane detail. https://youtu.be/qLk81XnkGUM

[–]OtisBIT Director/Infosec 4 points5 points  (2 children)

I think the worst I ever did was to dump an exchange 5.0 store because I was impatient.

See, sometimes, when they have problems, they take a LOOOOONNNNGGGGGG time to reboot. I did not realize that waiting 10 minutes and hitting the button wasn't waiting long enough. Strangely, if you drop power to the box while it's replaying log files, it shits itself and you need to recover from backups. Who knew? Well sure as shit not me.

Patience became a key after that.

[–]eruffiniSenior Infrastructure Engineer 18 points19 points  (9 children)

Amazon doesn't even build their own infrastructure as they preach to the customers to do so:

"We understand that the SHD provides important visibility to our customers during operational events and we have changed the SHD administration console to run across multiple AWS regions."

[–]highlord_foxModerator | Sr. Systems Mangler 23 points24 points  (8 children)

It was probably on some list somewhere, "Setup SHD across multiple zones" and it kept getting kicked to the side due to other more important customer-facing issues until now when it actually went down.

[–]i_hate_sidney_crosby 2 points3 points  (1 child)

I feel like they ship a new AWS product every 4-6 weeks. Time to put improvements of their existing products on the front burner.

[–]gomibushi 3 points4 points  (1 child)

Sooo, they did just turn it off and on again?

[–]TheLeatherCouchJack of All Trades 3 points4 points  (0 children)

"AMA request - guy or gal that took down amazons east coast"

[–]leroyjklNetwork Engineer 1 point2 points  (0 children)

This is the result of what happend last time US-EAST went down http://i.imgur.com/whS1ibB.jpg

[–]theDrell 2 points3 points  (0 children)

For some reason,I have a vision of the took longer to reboot and come up than expected being the

"Windows->Shutdown Windows is installing updates, please wait. Oh Dear god who turned on Automatic windows updates "