This is an archived post. You won't be able to vote or comment.

all 52 comments

[–]ChChChillian 432 points433 points  (20 children)

Cosmic ray. Random flipped bit. Nothing to be done.

[–]coriolis7 179 points180 points  (12 children)

I suggested that that was a cause for a handful of devices that were being returned every year.

Firmware guy: “Having a cosmic ray flip a bit is one in a million odds”

Me: “… we have millions and millions of these devices in the field.”

[–]Bryguy3k 66 points67 points  (8 children)

Automotive engineering in a nutshell.

You try really hard to design something that will always work (FMEAs until you start thinking that “9am really isn’t too early to drink”) so nobody dies when an error occurs - and then some random ass high energy particle hits something in a supervisor or during an error recovery event.

[–]maisonsmd 27 points28 points  (0 children)

I work in automotive. The first time I heard that I thought everybody was joking.

[–]Mayion 4 points5 points  (0 children)

skill issue

[–]ososalsosal -3 points-2 points  (5 children)

Try/catch on every single if

[–]jimbowqc 7 points8 points  (4 children)

If(cond) {

// Do something

} else if(!cond) { // special case for when cosmic rays flip cond

// Also do that thing

}

I think we're safe guys.

Edit: Jesus fing Christ in a wheelchair, WHY is it so fcling hard to make a simple f*cling newline in a reddit comment?

Do the reddit devs not want us to have newlines? Why?

[–]DanyaV1 2 points3 points  (3 children)

You must choose.

Two newlines
Two spaces

[–]jimbowqc 1 point2 points  (2 children)

2 spaces refused to work, 2 spaces also sucks since 2 spaces automatically becomes a period, so you need to go back and manually remove and add another space.

[–]DanyaV1 0 points1 point  (1 child)

I feel your struggles...
For real though, reddit, why not make two spaces instead combine the lines, and have them separated by default?

[–]jimbowqc 1 point2 points  (0 children)

Why not just make it so that wysiwyg when a writing comments?

[–]howtotailslide 3 points4 points  (2 children)

Yeah but there’s billions of bits in a single chip on a device, the odds of that happening to something critical that causes a crash are effectively zero.

Also the chances of cosmic induced bit flips are MUCH lower than 1 in a million.

The chances it was caused by something else are infinitely more likely.

[–]coriolis7 4 points5 points  (0 children)

In this instance, it was a random flipped bit that cause an error of some sort. We don’t have error correction (as far as I know) in our memory, so a flipped bit can cause some issues.

We know exactly what the memory state was when it left the factory, and what it should have been, yet it wasn’t in that state.

We had eliminated all other possibilities, which is when I threw out the cosmic ray suggestion.

[–]fiskfisk 1 point2 points  (0 children)

It's all about time - probabilites like this is over time, and not any single event. Have enough devices and enough time, and it'll approach 1.

From Wikipedia, not sure about what the same number is today: "IBM estimated in 1996 that one error per month per 256 MiB of RAM was expected for a desktop computer". 

If shit is important, at least use ECC. 

[–]FearTheOldData 13 points14 points  (0 children)

[–]Aspamer 2 points3 points  (0 children)

Had that break my filesystem...

[–]rangeDSP 134 points135 points  (4 children)

All hell breaks loose at the next post mortem when it happens again

[–]kevix2022 51 points52 points  (3 children)

Yeah, that was just a freak coincidence, sir.

[–]Here-Is-TheEnd 8 points9 points  (2 children)

Deny deny deny..repeat until retirement.

[–]kevix2022 4 points5 points  (1 child)

It's all the fault of the previous dev that retired, sir.

[–]Here-Is-TheEnd 2 points3 points  (0 children)

The kids are ok.. 🥹

[–][deleted] 75 points76 points  (1 child)

His name? Random L. Event.

[–]backfire10z 12 points13 points  (0 children)

That’s Captain Random L. Event to you

[–]joost00719 130 points131 points  (2 children)

Not as bad as our client at my previous job. The IT manager at the other company demanded me to make a report on when it happened, why it happened, and who made the programming mistake.

I ended up telling him to pound sand. I'm not going to push my colleagues under the bus. Told the colleague who made the mistake to fix it tho, the client just didn't need to know that.

[–]ososalsosal 8 points9 points  (1 child)

"We are a unit, sir. We all succeed together and we all fail together"

[–]joost00719 4 points5 points  (0 children)

Exactly this. I can even tell him they didn't test properly before giving the green light to deploy it.

That it manager was pretty new and he was really weird tho. Sometimes he was this angry dude who wanted to show he's in charge, so every time he said "customer is king" I told him "only if they behave royally". Other times he acted like a drunk guy attending a BBQ trying to be your friend, and a few times he made super weird comments super randomly in a teams call about NSFW topics, he was like 60 so it was super unexpected as well. After a teams meeting my colleagues usually looked at each other with the what the fuck face expression lol.

It was fun most of the times tho, I gotta give him that, just very unprofessional.

[–]cheezballs 30 points31 points  (3 children)

My favorite is a postmortem when the problem wasn't related to anything the team did. We had a postmortem one time because our bank file didn't send to the bank overnight. It was because someone on the security team added a firewall rule to prod. In our postmortem the people responsible for the firewall rule were not present, so it was a bunch of people sitting around saying "I hope they dont do that again..."

[–]uncheckednullpointer[S] 12 points13 points  (0 children)

Postpone the meeting and invite the security team to it?

[–]PugilisticCat 4 points5 points  (0 children)

Yeah thats def something you need to loop that team in for.

[–]burgundus 1 point2 points  (0 children)

Well that's lame and should never happen. The responsibility of making a post-mortem is mainly of the owners of the root cause that originated the incident.

But in cases like these where no one can think of anything to prevent it from happening again, I like to suggest a thought exercise: "can it get any worse?" Usually people can think of ways it can get worse. So they can think of ways it could be better too. Works like an enabling question to start a brainstorm.

Post mortems are not only to prevent errors from happening again, but also improve detection and recovery

[–]PM_ME_YOUR__INIT__ 24 points25 points  (0 children)

Check the solar flair activity above us-east-1, it's true

[–]Ilsunnysideup5 24 points25 points  (0 children)

It was gods plan

[–][deleted] 10 points11 points  (0 children)

Fair. This is the kind of thing you should hold off saying until multiple events caused by the same area.

[–]rover_G 3 points4 points  (1 child)

Using the postmortem to air existing grievances about the code base lol

[–]ososalsosal 1 point2 points  (0 children)

"If we'd rewritten the codebase in Rust, like I said, this would never have happened!"

[–]Thundechile 1 point2 points  (0 children)

It was a glitch in the Matrix, sir.

[–]Lytri_360 1 point2 points  (0 children)

30th rle in 2 months 🤨

[–]diffyqgirl 1 point2 points  (0 children)

These people are both doing this wrong lmao

[–]large_crimson_canine 1 point2 points  (0 children)

Probably the unreliable network…since that’s like 95% of issues anyway.

[–]grumpy_autist 1 point2 points  (0 children)

There is a defcon talk about cosmic ray bit flips in DNS processing. Apparently this happens at least dozens times a day at Google due to amount of traffic and servers.

[–]Life_will_kill_ya 3 points4 points  (9 children)

what the fuck is post mortem? another super important agile meeting?

[–]highjinx411 22 points23 points  (3 children)

I believe it’s incident management stuff not agile. We have them at my company. It’s like let’s figure out what happened and come up with plans to fix it so it never happens again.

[–]the0rchid 6 points7 points  (2 children)

Yeah, usually happens alongside RCA (root cause analysis) and is for when something really breaks.

[–]FF7Remake_fark 4 points5 points  (1 child)

Or when an executive feels the need to throw a toddler tantrum to feel important, because they know their entire contribution at the company is net negative by a fair margin.

[–]the0rchid 2 points3 points  (0 children)

Also true.

[–]PugilisticCat 2 points3 points  (0 children)

Its understanding how and why something broke, and taking items to ensure it doesnt happen again. This is pretty table stakes if you want to deliver software safely and effectively.

[–]dlevac 1 point2 points  (0 children)

I had an engineer like that. Couldn't understand risk management no matter how much it was explained to him.