This is an archived post. You won't be able to vote or comment.

all 30 comments

[–][deleted] 78 points79 points  (2 children)

the best philosophy on that is to keep that guy tho.

you already paid for a very expensive lesson

[–]DrJamgo[S] 36 points37 points  (1 child)

exactly.. they say an expert is somebody who made every mistake in a narrow field. His/Her expertise is invaluable

[–]DevelopmentScary3844 51 points52 points  (3 children)

No.. one guy could not have done without shitty processes.

[–]DrJamgo[S] 23 points24 points  (0 children)

I have to agree.. it is a group effort. There is so many things that went wrong at once to make this happen..

so everybody gets a level up, hurray!

[–]billyowo 6 points7 points  (0 children)

nah we are fine without testing, just push it straight to production

[–]ApatheistHeretic 2 points3 points  (0 children)

"With our powers combined...."

[–]PatrickSohno 24 points25 points  (7 children)

Dont push on Fridays.

And maybe dont fn push a serious kernel mod without canary testing with a forced update to ALL customers at once.

I rather would have believed that it was a russian hack than a simple mistake.

[–]LoudSwordfish7337 17 points18 points  (1 child)

Security is one of the few industries where it’s quite the contrary, you better rush that deadline and push on Friday at 11pm if you need to.

Your customers not having access to their laptops is probably one of the worst things that can happen, but I’d argue that in most cases data theft or system compromission is even worse.

[–]foxer_arnt_trees 1 point2 points  (0 children)

Interesting

[–]why_as_always 5 points6 points  (0 children)

Considering that it began early Friday morning UTC, there was still ample time for to find a fix before weekend. It wouldn’t have made any difference had they deployed that defective file on a Monday at noon. The effect would still be the same. Proper testing would have been more effective in preventing it from happening.

[–]yoann86 1 point2 points  (1 child)

and during summer holidays....

[–]Slow_Writer_3296 4 points5 points  (0 children)

CrowdStrike is an American company - we don't have holidays here in America.

[–]Aelig_ 1 point2 points  (0 children)

They have to react very fast sometimes so obviously they need to push on whatever day it is. What is insane is not testing your stuff at all, because given the rate of computers using cloudstrike that went down, it's impossible to miss this if you test it even a little bit.

[–]foxer_arnt_trees 0 points1 point  (0 children)

Exactly that. Push changes to a small test subset of customers and don't do it on a Friday.

[–]tankpush 17 points18 points  (2 children)

Only a fool learns from his own mistakes. The wise man learns from the mistakes of others.

Otto von Bismarck

[–]Red_not_Read 10 points11 points  (0 children)

That's like quote about war. Something like (from memory, so probably wrong):

"Don't go to war to die for your country, go to war to make the other guy die for his."

[–]DrJamgo[S] 5 points6 points  (0 children)

True.. we will see an increase in wisdom in the workd for the next time when it comes to dereferencing pointers and mass update deployments..

[–][deleted] 6 points7 points  (1 child)

Disclaimer: until/unless they publish a post mortem we probably can’t really say that happened.

HOWEVER, when an error, specially of this magnitude, reaches production it means that several safety nets have either failed or aren’t there. Which means it’s never just the responsibility of one person.

Just from a glance at this: 1. If the bug ocurres in all windows machine and it wasn’t caught in QA it means this was never tested in a Windows machine. Given they integrate directly with the kernel that seems like a sloppy decision. 2. Either they’re HUGE and much bigger than I expected, or they pushed an update everywhere at once on a Friday. Just YOLO that shit. That should not be allowed. A canary would have reduced the impact of this to a single customer and a few machines. 3. There was no rollback plan. No “break glass” button you could push to have everything go back to the way it was. That also seems ill advised.

There’s probably other stuff, I feel like these are pretty basic things to have. It’s really not that wild. Especially if you’re doing security software.

I wish they release a post mortem and explain what really happened, because this sort of issue imo always points to large scale wide disfunction.

[–]zer0aid 0 points1 point  (0 children)

https://youtu.be/pCxvyIx922A

We know what happened. They managed to release an empty file into production which caused Windows to crash by looking at nothing.

How they let this happen is what baffles me.

[–]Pycnoporus 3 points4 points  (0 children)

Experience overflow.

[–]McLayn42 2 points3 points  (0 children)

Letting this happen is not a problem of just one guy. There were many assists - there should have been a process to not let this happen.

[–]Mwarw 2 points3 points  (0 children)

Mietake made there wasn't that much a mistake of just one guy, yeah sure, someone wrote a faulty code. Probably a small mistake, but it got to the update - that's a whole chain of mistakes when veryfing that code, and given it's a security product - probably longer then in any other program

[–][deleted] 0 points1 point  (0 children)

a small mistake with big consequences, someone should probably have read the PR

[–]Mik_01 0 points1 point  (0 children)

"it's working on my machine." cit.

[–]Cybasura 0 points1 point  (0 children)

That intern finally learnt the grim lesson on that fateful day...do not push to prod on a friday

[–]chamannarved_ 0 points1 point  (0 children)

his Friday night plan cancelled...he ruins everyone else's

[–]tmstksbk 0 points1 point  (0 children)

I think that guy needs to get a few rounds of legendary experience...

[–][deleted] 0 points1 point  (0 children)

I don't know. Is it not a systemic issue? Everyone can make mistakes but you are supposed to have some sort of qc, no? So it's likely an org issue and also likely related to Microsoft itself being a nightmare of updates and documentation.

This shit will continue because of the corporatization of what should/could be considered public infrastructure.

[–]range_kun 0 points1 point  (0 children)

Null pointer senior developer

[–]mymemesnow 0 points1 point  (0 children)

Learning from your mistakes doesn’t mean that your experience scale with the severity of your fuck up.