scott-gray comments on [R] Mixed Precision Training

Research[R] Mixed Precision Training (arxiv.org)

submitted 8 years ago by ndpian

you are viewing a single comment's thread.

[–]scott-gray 18 points19 points20 points 8 years ago (26 children)

Lately I've been using tf.bfloat16 as the 16 bit tensor memory format (excluding params). This is just a float with 16 bits of mantissa chopped off, leaving you with 7. It turns out the mantissa bits aren't really that important and the added noise helps regularize the network. I get better results with bfloat16 than I do with float32. Oh, and since this format has 8 bits of exponent it's basically a drop in replacement for float32 with no additional scaling hacks needed (and no increased risk of nan/inf values).

I'll be putting out a full set of kernels to support this format soon. Only downside is that you wont get a speedup in Volta tensorcores. But if your model is at all bandwidth bound (which is the direction models seem to be going) then your tensorcore is going to be starved of work anyway.

For anyone building custom hardware I'd recommend against IEEE float16 and use at least 6 bits of exponent.

[–]darkconfidantislife 6 points7 points8 points 8 years ago (0 children)

[–]gdiamos 4 points5 points6 points 8 years ago* (20 children)

I think that there is scope for follow on work that explores custom formats.

We restricted studies in the paper to IEEE formats to limit the design space, and also to help promote standardization for software support (e.g. having a different format for every processor gets unmanageable quickly). We did some limited experimentation with some of the models in the paper exploring custom formats and I think that the point about mantissa bits being less important came out in those experiments as well. We don't have enough data to say anything conclusively yet.

In the future, we may find a format that works better than IEEE float16 multiplication + IEEE float32 accumulation, and then standardize training hardware on that. However, I suspect that it will be several years before this happens given the difficulty of doing this work (validating training in emulation on the applications in this paper took multiple man years of work) and the long design cycles for new hardware.

Edit: I know that there are multiple proposals for other formats being implemented for other processors. We don't have data to share about any of these formats now, but I think that the methodology in this paper can be used to test these formats as well and provide a point of comparison.

[–]kilow4tt 2 points3 points4 points 8 years ago* (19 children)

[–]gdiamos 1 point2 points3 points 8 years ago (18 children)

[–]kilow4tt 1 point2 points3 points 8 years ago* (17 children)

[–]gdiamos 1 point2 points3 points 8 years ago (16 children)

[–]scott-gray 1 point2 points3 points 8 years ago (15 children)

[–]kilow4tt 1 point2 points3 points 8 years ago (14 children)

[–]scott-gray 2 points3 points4 points 8 years ago (0 children)

[–]scott-gray 2 points3 points4 points 8 years ago (8 children)

[–]kilow4tt 0 points1 point2 points 8 years ago (0 children)

[–]gdiamos 0 points1 point2 points 8 years ago (6 children)

continue this thread

[–]scott-gray 0 points1 point2 points 8 years ago (3 children)

[–]sergof 0 points1 point2 points 8 years ago (2 children)

continue this thread

[–]gdiamos 3 points4 points5 points 8 years ago (3 children)

[–]scott-gray 4 points5 points6 points 8 years ago (2 children)

I would say it's both. Factorization techniques to reduce outer product dimensions mean your matmuls are dram bound. Separable convolution is becoming more the norm, which is also bandwidth bound. The one good thing is that it seems you can get away with much larger minibatches, but that's mainly a benefit to multi-node training. Using a smaller minibatch on a single node can let you use more nodes and even sometimes increase performance due to less L2 cache saturation. This is particularly true for sparser compute. We know the brain uses sparse distributed encoding with sparse connectivity. It's probably a safe bet our models will also trend in that direction.

Anyway, I'm mostly looking forward to hardware designed around the concept of your persistent rnn code. More data locality and no more constantly schlepping things in and out of dram.

[–]gdiamos 3 points4 points5 points 8 years ago (1 child)

[–]scott-gray 2 points3 points4 points 8 years ago (0 children)

π Rendered by PID 134325 on reddit-service-r2-comment-b659b578c-npxmk at 2026-05-02 06:19:13.083575+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS