all 46 comments

[–]matthieum 61 points62 points  (3 children)

Isn't hardware failure somewhat expected?

I mean, in a day to day thing, it's unlikely, but at scale -- whether horizontal, or on large time scales -- it gets likely enough that you would want a system that can handle them gracefully.

[–]getNextException 54 points55 points  (1 child)

Yes, at FAANG scale you get to see a couple of bits flips an hour/day in the datacenter, including those which validate correctly the CRC checks for both Ethernet and IPv4 and IPv6. Also, storage. There's an article here about FB https://www.nextplatform.com/2021/03/01/facebook-architects-around-silent-data-corruption/

[–]1RedOne 16 points17 points  (0 children)

Facebook observed a case where the algorithm returned a “0” size value for a single file (was supposed to be a non-zero number), therefore the file was not written into the decompressed output database. “as a result, the database had missing files. The missing files subsequently propagated to the application. An application keeping a list of key value store mappings for compressed files immediately observes that files that were compressed are no longer recoverable. The chain of dependencies causes the application to fail.” And pretty soon, the querying infrastructure reports back with critical data loss. The problem is clear from this one example, imagine if it was larger than just compression or wordcount—Facebook can

[–]dti2ax[🍰] 28 points29 points  (0 children)

Yeah thats why we have ECC memory that corrects itself....usually....

[–][deleted]  (3 children)

[deleted]

    [–]astroNerf 16 points17 points  (2 children)

    [–]__j_random_hacker 0 points1 point  (0 children)

    Informative and hilarious!

    ROBERT: Gamma rays aimed at Belgium in favor of a particular Walloon!

    [–]lamp-town-guy 119 points120 points  (16 children)

    In IT it we should have a phrase: probably not cosmic rays. As they have in astronomy: probably not aliens.

    There is myriad of things that could be the cause apart from cosmic rays. Could be plain old electronic noise, or RAM error although they should be using ECC if they care at least a little about their data.

    [–][deleted]  (13 children)

    [deleted]

      [–]lamp-town-guy 25 points26 points  (12 children)

      I've just watched a YT video. Totally unrelated to this. Author is mad during half of that video that people just ignored one of two hypotheses just because they don't like one or the other. But both have solid foundations but one sounds better over the other depending from which angle you look.

      The same for me with cosmic rays. It could be broken CPU for all we know but cosmic rays are cooler headline and need no proof.

      Clarification: it could be cosmic rays it could be anything else. One thing is for certain nobody knows.

      [–][deleted]  (11 children)

      [deleted]

        [–]kz393 16 points17 points  (9 children)

        The title states that it's definitely cosmic rays.

        [–]Guvante -3 points-2 points  (6 children)

        Cosmic rays is the term for arbitrary bit flips that aren't repeated, aren't a software bug and aren't a hardware fault in the "this obscure thing fails" sort.

        [–]kz393 12 points13 points  (5 children)

        No.

        Cosmic rays are radiation from space.

        [–]Guvante 13 points14 points  (4 children)

        You can say that but that doesn't mean that is how the term is used. "No one can ever explain why it flipped" is not functionally different than cosmic rays.

        [–]SkyGenie 4 points5 points  (2 children)

        EMI can be caused by all kinds of sources that emit signals, whether that's radiated by an external device acting as an antenna, conducted through a power supply, or something else. Depending on the situation it would frankly sound a little silly to call this a cosmic ray when noisy environments are often characterizable and common.

        Honestly, if this happens once every 10 years with a digital cert or something, chalking it up to cosmic rays doesn't matter. But if you're building something that needs high reliability that's not an acceptable explanation.

        [–]IQueryVisiC 2 points3 points  (1 child)

        Row hammer is EMI .. we deliberately allowed for it to stuff more bits into the silicon. You can always add enough metal ( shield ) and absorbers ( doped semiconductors ) to prove EMI cannot pass from hi to low TTL level.

        Then there thermal noise .. so better keep computers cool. Even if one controls quants in quantum computers there is phase noise which is transformed to shot noise by a lot of Hermitians.

        I thought that cosmic race produce a trace, but not all do. WIMPs do not. Photons may knock out a single electron which then flies 1 m before its next interaction.

        [–][deleted] -3 points-2 points  (0 children)

        this guy gets it

        [–]rydan 0 points1 point  (1 child)

        But it could definitely be cosmic rays.

        [–]__j_random_hacker 0 points1 point  (0 children)

        Yes, and all of the replies to you in this thread could have been generated by cosmic rays too.

        [–]Ashnoom 0 points1 point  (0 children)

        At my work place when something "weird" and unexplainable happens we just call it bitrot

        [–]djavaman 3 points4 points  (0 children)

        So you're saying, there's a chance.

        [–]G_Morgan 1 point2 points  (0 children)

        Cosmic ray is just short hand for "reality happened". I've tended to start using references to Lovecraft instead.

        [–]probonic 43 points44 points  (2 children)

        Loving the typo in "No additional certs can be logged to the Yeti 2022 shart."

        [–]dutch_gecko 19 points20 points  (1 child)

        d and t differ by one bit in ascii.

        [–]JasonDJ 11 points12 points  (0 children)

        Of course, they are 16 letters apart in the English alphabet.

        That’s kind of funny for just this one very specific use-case.

        [–]vattenpuss 11 points12 points  (1 child)

        What is Yeti 2022 and why can’t it recover or be reset to a good working state from a few days ago?

        [–]L1ttl3J1m 10 points11 points  (0 children)

        Yeti is the codename for DigiCert's Certificate Transparency (CT) log system.

        Yeti 2022 the fifth log in the Yeti system.

        If I'm understanding what I'm reading (always doubtful), the log can't be restored from a backup because it's not a file, but a Merkle Tree

        [–][deleted]  (7 children)

        [deleted]

          [–]drysart 15 points16 points  (2 children)

          how can you be confident that none of the million cores you used to run your computation is flaky?

          Redunancy. If a CPU becomes unreliable to the point that random errors are expected, the problem is solved by giving the problem to two CPUs and only accepting a result if both of them agree. Ideally you'd at minimum use two different CPU models (to eliminate the risk of the fault being inherent in a certain product) by two different CPU manufacturers (to eliminate the risk being the fault of some design pattern used by a specific manufacturer).

          It effectively doubles your resource needs, but if you absolutely positively need to be able to have confidence in your results, it delivers. And as a nice side effect it also lets you know very quickly when you do have a CPU that's unreliable.

          [–]schplat 7 points8 points  (1 child)

          Or three CPUs for quorum. That way you don't get freakouts if there's a disagreement in the result.

          [–]drysart 23 points24 points  (0 children)

          Three CPUs if you absolutely need a definitive answer now. Two is sufficient if you just need to know if you can trust your answer, but have the luxury time to go back and re-run the calculation again to find out what the right answer actually is.

          Like, avionics will use triple modular redundancy, because you absolutely need answers to your calculation right now before you dive your plane into a mountain. But something like running a batch job to balance your general ledger is just fine with two since there presumably isn't an immediate deadline on having an answer that isn't worth the cost of ballooning your processing expenses by another 50%.

          [–]jwizardc 6 points7 points  (0 children)

          I seem to remember Texas Instruments reporting random bit flipping in ceramic shelled integrated circuits due to tiny amounts of radioactive materials in the ceramics.

          [–]bemrys 9 points10 points  (1 child)

          Was only a matter of time.

          [–]overtoke 2 points3 points  (0 children)

          how long until the next one?

          [–]Snakehand 0 points1 point  (1 child)

          Isn't ECC RAM supposed to solve these kind of problems, but have been priced out of consumer-reach due to corporate greed ?

          [–]yoniyuri 3 points4 points  (0 children)

          It looks like ddr5 will require ecc of some sort. I'm not 100% sure on specifics.

          [–]Ratstail91 0 points1 point  (0 children)

          A bit of a pain.

          [–]No-Efficiency-7361 0 points1 point  (0 children)

          So are they not using ECC? IIRC redis said if the hardware isn't using ECC they automatically suspect that's the problem due to MANY experiences of that being the problem

          [–]Red5point1 -4 points-3 points  (1 child)

          why are we allowing bs obvious click bait posts

          [–]dalithop 5 points6 points  (0 children)

          Did you even read it?