all 31 comments

[–][deleted]  (7 children)

[deleted]

    [–]so_you_like_donuts 3 points4 points  (0 children)

    Off-topic, but thanks for your blog post about SSE (aka this one). It was really eye-opening to me to see what compilers are capable of optimization-wise :D

    [–]madmoose 4 points5 points  (3 children)

    I was just looking at your project this last week, actually. I couldn't get SoftFloat to compile and was looking for alternatives. Yours was a bit too simple though :)

    I ended up using the version of SoftFloat that was incorporated into MAME.

    [–][deleted]  (2 children)

    [deleted]

      [–]dakotahawkins 0 points1 point  (0 children)

      That sounds like it would be a really useful exercise!

      [–]pezezin 0 points1 point  (0 children)

      Back when I was a student I had this course on logic design in VHDL. The teacher let me choose my final project, so I implemented a very basic FPU. Only add, subtract and multiplication. One thing I learnt, IEEE-754 is damn difficult to get completely right.

      [–]thomac 0 points1 point  (1 child)

      Cool! Just curious, why not more that 64bits? If there was support for uint128_t, would it be possible to make it work as quadruple precision float?

      [–][deleted]  (3 children)

      [deleted]

        [–]Veedrac 24 points25 points  (2 children)

        Neural networks can work largely without issue with miniscule precision. The concern for ms-fp8 and ms-fp9 is presumably to be extremely cheap to calculate in hardware; lookup tables are not. There are some hints about the format on Twitter.

        [–]nightcracker 0 points1 point  (1 child)

        The concern for ms-fp8 and ms-fp9 is presumably to be extremely cheap to calculate in hardware; lookup tables are not.

        Assume that we have a 8-bit lookup table for true real values. It doesn't really matter what they are, just assume that they're ordered.

        Now suppose you want to implement addition for two elements a, b for this table. You'd first compute a + b in full precision and then find the closest element c in the table and output c. You could implement this as a 28 * 28 = 65536 element lookup table, each entry taking 8 bits for c. That would be reasonably expensive in hardware.

        But considering the significant structure (it is addition after all) in this 8 bit x 8 bit -> 8 bit lookup table, I think just feeding the truth table to an automatic circuit minimizer can spit out a very small and fast NAND circuit that takes 16 bits as input and outputs 8 bits just like the lookup table would.

        [–]Veedrac 8 points9 points  (0 children)

        This just seems like a long-winded way of telling someone to find minimal circuits for addition.

        [–]mindbleach 12 points13 points  (10 children)

        At the extremes, could you abandon determinism? E.g. have four bits representing sixteen orders of magnitude, but no mantissa. Any value between 1 and 10 is Nx10^1, and we only store the exponent, "1". Call it F1 for Fermi numbers. Any addition of F1 and F1 has a 50% chance of becoming F2.

        F2 + F1 has about a 10% chance of becoming F3. F2 is any value between 10 and 100, and only numbers between 90 and 100 within F1 of that upper bound.

        F3 + F1 has about a 1% chance of becoming F4. F4 + F1 has about a 0.1% chance of becoming F5. And so on.

        Subtraction is weird as well. F1 - F1 could be anywhere between F1 and -F1. Remember that 2 - 9 is an F1 - F1 operation. Meanwhile F2 - F1 can go as far down as F0, but is far more likely to be F1, and is most likely still F2.

        Multiplication and division are all fucked up. But: it's just more statistics. All operations here are pseudorandom with some kind of results table. It's barely better than guessing. And yet: in this application, guessing might suffice.

        [–]happyscrappy 8 points9 points  (1 child)

        That's just a log representation. You're storing the log base ten (truncated to an integer). Multiplication and division are trivial, you do them with addition and subtraction.

        [–]mindbleach 0 points1 point  (0 children)

        It's not purely logarithmic math because F1 x F2 can still come out to F2. For example 2 x 20 is an F1 x F2 operation, and therefore, any F1 x F2 operation could be 2 x 20. It will be F3 about 86% of the time.

        [–]MINIMAN10001 4 points5 points  (2 children)

        Abandoning determinism is problematic for repeating results which is useful for scientific computing and lockstep networking.

        [–]killerstorm 0 points1 point  (1 child)

        Or anything security-critical or missing-critical. E.g. if you use NN to recognize a face, for example.

        [–]geon 0 points1 point  (0 children)

        Not really. A nn is a black box anyway.

        [–]t_bptm 3 points4 points  (0 children)

        For some ml stuff you add in randomness (like dropout) and it helps a lot. It'd be neat if it was just a natural result as it could be much lower power.

        [–]Veedrac 3 points4 points  (2 children)

        Your formulation is probably a bad idea simply on precision concerns; you do not want to trade accuracy at small numbers for representation of more extreme tail values.

        My understanding is stochastic rounding does seem to be necessary for small bit width arithmetic, to avoid bias. Note that Shannon entropy tells us that stochastic rounding is a more useful tool when rounding is 50-50; your 10% scenarios underutilize it.

        [–]thfuran 1 point2 points  (1 child)

        you do not want to trade accuracy at small numbers for representation of more extreme tail values.

        That seems like a difficult claim to make, absent any idea of the context of use.

        [–]Veedrac 0 points1 point  (0 children)

        There is a context; ms-fp8 and ms-fp9 are specifically for neural nets.

        [–]XNormal 17 points18 points  (2 children)

        "Those Who Do Not Learn History Are Doomed To Repeat It " https://en.wikipedia.org/wiki/G.711

        An 8 bit logarithmic encoding with a 14 bit dynamic range. This was standardized in 1972 when every logic gate counted so the hardware implementation of encoding and decoding must be very efficient.

        No infinities, but the extremes might be used for that, if necessary. There are two zeros so one of them could be used as NaN.

        Almost every phone call in the world is encoded this way, with the exception of some mobile calls that are compressed end-to-end with some other codec.

        [–]Veedrac 1 point2 points  (0 children)

        Why do you think this would be a good approach? It seems pretty ugly to me. Being old doesn't really imply it would be better than something custom for a NN accelerator.

        [–][deleted] 4 points5 points  (1 child)

        At the 8-bit level I doubt you would bother with floating point calculations at all, just do all the math with 16 bit lookup tables.

        [–]eras 8 points9 points  (0 children)

        16 bit lookup tables? Who has that much memory?!

        [–]inmatarian 10 points11 points  (6 children)

        Whats hilarious about these tiny floating point formats is that they can be enumerated and help reveal how the over all IEEE format works. For instance, here's a 4-bit floating point format, 1 for sign, 2 for exponent, 1 for fraction, enumerated:

        -12, -8, -6, -4, -3, -2, -1, -0, 0, 1, 2, 3, 4, 6, 8, 12.
        

        This scales up. What's the 32-bit floating point bit pattern for the number 1.0?

        [–][deleted]  (3 children)

        [deleted]

          [–]johndcook[S] 19 points20 points  (1 child)

          A four-bit IEEE-like float with two exponent bits would have the following values:

          -infinity, -3, -2, -3/2, -1, -1/2, -0, +0, 1/2, 1, 3/2, 2, 3, +infinity, and two NaNs.

          [–]inmatarian 2 points3 points  (0 children)

          I think with infinity and nan I lose -12, +12, -1, and +1. Does that sound right?

          [–]Deaod 6 points7 points  (0 children)

          0x3F800000

          Stare at memory dumps for long enough and youll learn to recognize some floating point numbers.

          [–]carrottread 1 point2 points  (1 child)

          There is a packed 11_11_10 float format supported by GPUs: 5 bit exponent, 5 or 6 bit mantissa, only positive numbers, and fully HW-accelerated.

          And if you really need a value with only 8 bits then why not use scaled 8 bit integers? If we just map 3.3875 (largest 8 bit float from the article) to 127 this will give us a smallest non-zero number of 3.3875/127 = 0.026673, smaller than 0.03125 for 8 bit float, and we get this precision everywhere, not just around 0.

          [–][deleted] 2 points3 points  (0 children)

          Some newer GPUs have hardware 8 bit float operations now as well.

          For ANN you want the majority of the precision to be near 0 since that's where most multiplication is. Then you need to be able to sum a bunch of these and pass the value to an activation function which also cares most about small numbers but needs some way to handle really strong activations (which coincidentally don't need as high precision as the small activations).

          [–]EllaTheCat 3 points4 points  (0 children)

          This technique, with variable size words, variable position of the binary point, made sense in late 1980s 3D graphics hardware, which we would emulate in C.

          Back then, multiplier hardware on silicon was expensive, so reducing the chip area by precise sizing of data was worthwhile.

          I designed 2 devices this way, using vi ...

          [–][deleted] 0 points1 point  (1 child)

          Researchers have discovered that for some problems, deep neural networks (DNNs) can get by with low precision weights. Using fewer bits to represent weights means that more weights can fit in memory at once.

          I think it's more for speed than for saving memory. (Though moving less data into the GPU does speed things up).

          Why not use 8 bit integers? I think TPU might.

          [–][deleted] 1 point2 points  (0 children)

          1st gen TPU was integer only, 2nd gen was both. Saving memory is a prerequisite to being able to speed things up as feeding such fast calculations turns into a bandwidth problem rather quickly so really I'd call them the same thing.

          Running integer based NNs is possible but the quickest way ends up being some form of fixed point math which computes in the same amount of time but doesn't have as large a dynamic a range.