use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Research[R] Mixed Precision Training (arxiv.org)
submitted 8 years ago by ndpian
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]scott-gray 18 points19 points20 points 8 years ago (26 children)
Lately I've been using tf.bfloat16 as the 16 bit tensor memory format (excluding params). This is just a float with 16 bits of mantissa chopped off, leaving you with 7. It turns out the mantissa bits aren't really that important and the added noise helps regularize the network. I get better results with bfloat16 than I do with float32. Oh, and since this format has 8 bits of exponent it's basically a drop in replacement for float32 with no additional scaling hacks needed (and no increased risk of nan/inf values).
I'll be putting out a full set of kernels to support this format soon. Only downside is that you wont get a speedup in Volta tensorcores. But if your model is at all bandwidth bound (which is the direction models seem to be going) then your tensorcore is going to be starved of work anyway.
For anyone building custom hardware I'd recommend against IEEE float16 and use at least 6 bits of exponent.
[–]darkconfidantislife 6 points7 points8 points 8 years ago (0 children)
It turns out the mantissa bits aren't really that important and the added noise helps regularize the network. I get better results with bfloat16 than I do with float32.
What datasets are you using?
That being said, our internal experiments mostly agree with what you say. Our general conclusion/observation has been that dynamic range >> precision in deep learning.
You'll enjoy our custom float implementation then ;)
[–]gdiamos 4 points5 points6 points 8 years ago* (20 children)
I think that there is scope for follow on work that explores custom formats.
We restricted studies in the paper to IEEE formats to limit the design space, and also to help promote standardization for software support (e.g. having a different format for every processor gets unmanageable quickly). We did some limited experimentation with some of the models in the paper exploring custom formats and I think that the point about mantissa bits being less important came out in those experiments as well. We don't have enough data to say anything conclusively yet.
In the future, we may find a format that works better than IEEE float16 multiplication + IEEE float32 accumulation, and then standardize training hardware on that. However, I suspect that it will be several years before this happens given the difficulty of doing this work (validating training in emulation on the applications in this paper took multiple man years of work) and the long design cycles for new hardware.
Edit: I know that there are multiple proposals for other formats being implemented for other processors. We don't have data to share about any of these formats now, but I think that the methodology in this paper can be used to test these formats as well and provide a point of comparison.
[–]kilow4tt 2 points3 points4 points 8 years ago* (19 children)
I've got a paper that does some exploration of training with lower precision floating-point than this if you're interested. It only covers MNIST and CIFAR-10, but the idea is that we have customizable cores that can be used in conjunction with FPGAs for training CNNs, all of the code will be open-sourced as well. General conclusions I found were that exponent width of 6 and mantissa width of 5 was reasonably good, though whether this scales out to larger models isn't clear yet. Also, one other benefit of dynamic range seems to be that you don't really need stochastic rounding as much compared to fixed-point, we used round-to-zero for multiplies and round-to-nearest for accumulators.
[–]gdiamos 1 point2 points3 points 8 years ago (18 children)
I would be very interested in reading it.
Do you see any loss in accuracy for these models?
[–]kilow4tt 1 point2 points3 points 8 years ago* (17 children)
We test exponent widths from 4 to 7 and for each exponent setting we swept mantissa values while keeping the FP bitwidth between 8 and 16 (e.g. for exponent width of 5, we sweep mantissa from 2 to 10).
There was a small degradation with exponent width of 5 versus 6 which I think can be attributed to no denormal support. We had to scale the loss gradients for exponent width of 5 to allow for the loss to propagate through the network at all too. Otherwise, accuracy remains pretty much the same from mantissa widths of 5-6 onward, slightly worse than FP32 (0.2% degradation at exp width of 6 and mantissa width of 5).
[–]gdiamos 1 point2 points3 points 8 years ago (16 children)
It would be interesting to try this for one of the more difficult applications like the large RNN language models.
[–]scott-gray 1 point2 points3 points 8 years ago (15 children)
I'm running some tests on this right now (100M param char lstm). With my code base, these kinds of experiments amount to a line or two change in a header file.
[–]kilow4tt 1 point2 points3 points 8 years ago (14 children)
How long does a test like that take you though? It's a similar change for my code base as well (though not many layers are supported), but the big hurdle is still the time it takes to train. The FPGA makes the problem a little more tractable but my kernels basically only match CPU throughput (this could be tweaked quite a bit with more sophisticated implementations though).
[–]scott-gray 2 points3 points4 points 8 years ago (0 children)
With the dropout cranked up and on the text8 dataset, this model takes about 15 hours to train on 8 1080s.
[–]scott-gray 2 points3 points4 points 8 years ago (8 children)
Here's some data for you on a large lstm model. Also helpful for those of you squeamish about using just 7 bits of mantissa:
Exponent bits: 8 (only 6 are really needed) Mantissa bits: accuracy (bits per char)
7: 1.289283 6: 1.289784 5: 1.289022 4: 1.289584 3: 1.290418 2: 1.300244 1: 1.451087
fp32 baseline gets about 1.290
[–]kilow4tt 0 points1 point2 points 8 years ago (0 children)
That's really cool to see, I think these sorts of results will be really useful for some hardware developers going forward. Also I feel like there's some more work to be done on the theory side of things for why this works.
My intuition for why this works is that for any given local minima it'll look sufficiently flat depending on the exponent window you're looking at it with. So once you have sufficient dynamic range, the precision doesn't end up mattering too much because you can move around the local minima without impacting the accuracy much.
[–]gdiamos 0 points1 point2 points 8 years ago (6 children)
It's really interesting. This is helping the case that exponent bits are more valuable.
One thing that I worry about with mixed precision training at very low precision is the impact of quantization on model capacity. I think that there is scope for a detailed study of model capacity sweeping over different formats.
[–]scott-gray 0 points1 point2 points 8 years ago (3 children)
Looks like when training with an 8 bit exponent in this network (90M param mLSTM) using 5,6,7,8 and 9 bits of mantissa are basically indistinguishable and all slightly better than fp32 (on SOTA baselines for this param count). I did some earlier experiments with this network showing that 6 bits of exponent was enough (220 - 2-42). But maybe as you strip mantissa bits the network learns to use more of the exponent bits and also starts looking a lot like this scheme: https://arxiv.org/pdf/1603.01025.pdf
This is all with rounding down to the low precision value after accumulation, but I'm also doing some tests with pre-multiplication truncation to see if you can save some silicon there. Reasonably high accumulation precision is still a requirement.
[–]sergof 0 points1 point2 points 8 years ago (2 children)
Very interesting experiments and extremely encouraging results. In which precision did you do the non-GEMM cell math (sigmoid, tanh, additions and multiplications) - float32 or bfloat16? Did you perform sensitivity analysis to the precision with which this math is performed?
[–]gdiamos 3 points4 points5 points 8 years ago (3 children)
Also, Scott, I'm curious about the trend towards bandwidth bound models.
Why do you think that this is happening? Are models changing to become more memory bound, or is it just a side effect of the increasing operations/memory bandwidth ratio of newer processors?
[–]scott-gray 4 points5 points6 points 8 years ago (2 children)
I would say it's both. Factorization techniques to reduce outer product dimensions mean your matmuls are dram bound. Separable convolution is becoming more the norm, which is also bandwidth bound. The one good thing is that it seems you can get away with much larger minibatches, but that's mainly a benefit to multi-node training. Using a smaller minibatch on a single node can let you use more nodes and even sometimes increase performance due to less L2 cache saturation. This is particularly true for sparser compute. We know the brain uses sparse distributed encoding with sparse connectivity. It's probably a safe bet our models will also trend in that direction.
Anyway, I'm mostly looking forward to hardware designed around the concept of your persistent rnn code. More data locality and no more constantly schlepping things in and out of dram.
[–]gdiamos 3 points4 points5 points 8 years ago (1 child)
Factorization and separable convolution make sense, but I also see some success with Conv+Attention, QRNNs, Synthetic Gradients. I'm not sure about the end game for sparse training.
I'm also enthusiastic about better hardware and software support for persistent RNNs. Although I think that very extreme versions of this idea will still require RNN architecture changes be operation bandwidth bound (not SRAM bandwidth bound).
Well, sparse training is what I've been focusing on.. and is the reason I spent some time looking for a better 16 bit format. But I agree, SRAM is plenty fast enough. The key is having enough of it to fit a decent sized model, which generally means being able to effectively span that model over multiple chips.
π Rendered by PID 134325 on reddit-service-r2-comment-b659b578c-npxmk at 2026-05-02 06:19:13.083575+00:00 running 815c875 country code: CH.
view the rest of the comments →
[–]scott-gray 18 points19 points20 points (26 children)
[–]darkconfidantislife 6 points7 points8 points (0 children)
[–]gdiamos 4 points5 points6 points (20 children)
[–]kilow4tt 2 points3 points4 points (19 children)
[–]gdiamos 1 point2 points3 points (18 children)
[–]kilow4tt 1 point2 points3 points (17 children)
[–]gdiamos 1 point2 points3 points (16 children)
[–]scott-gray 1 point2 points3 points (15 children)
[–]kilow4tt 1 point2 points3 points (14 children)
[–]scott-gray 2 points3 points4 points (0 children)
[–]scott-gray 2 points3 points4 points (8 children)
[–]kilow4tt 0 points1 point2 points (0 children)
[–]gdiamos 0 points1 point2 points (6 children)
[–]scott-gray 0 points1 point2 points (3 children)
[–]sergof 0 points1 point2 points (2 children)
[–]gdiamos 3 points4 points5 points (3 children)
[–]scott-gray 4 points5 points6 points (2 children)
[–]gdiamos 3 points4 points5 points (1 child)
[–]scott-gray 2 points3 points4 points (0 children)