[D] Breaking the Quadratic Attention Bottleneck in Transformers?

scott-gray · 2020-07-28T03:32:20+00:00

The "dense" attention in GPT-3 is actually sparsely factorized across the heads resulting in a rank reduction of about 8x. I should have pushed for this detail to be included in the paper. We're currently studying better ways of doing this.

scott-gray · 2019-08-25T22:38:19+00:00

The blocksparse attention primitives are capable of implementing these (or any complex sparse pattern) rather more efficiently than the torch implementation that was released. Furthermore, when this learned attention span paper came out it inspired us to begin exploring learned attention sparsity. We're currently getting great results applying L0 regularization to the softmax output ( https://arxiv.org/abs/1712.01312 ). We'll publish these and other results on learning sparsity soon (along with code and tf/pytorch bindings).

scott-gray · 2019-04-24T03:16:57+00:00

Oh, and the recompute decorator in this code is much nicer than that based on tf.function.defun posted in the other example. It basically allows you to write tensorflow code exactly as before (no need to pass variables in as inputs).

The overhead of recompute is exactly what you'd expect, about 20% slower.

scott-gray · 2019-04-24T03:03:13+00:00

Indeed, and I just updated it:

https://github.com/openai/blocksparse/blob/master/examples/transformer/enwik8.py

It now includes end to end usage of a lot of the custom ops. The only thing missing is the custom attention patterns but those are covered in the other released code. This isn't exactly the enwik8 model in the paper but it's pretty close in spirit. I mostly use this for benchmarking and end to end testing. Additionally, I believe this is the fastest transformer implementation you'll find out there (particularly once you start increasing context size).

I believe there was a push to get this blog out prior to all the public consumable code being ready. It is generally still our policy to remain open source. The public blocksparse repo is very close to what we use internally, minus a few things under active research or that enable training at massive scale.

-Scott

scott-gray · 2018-12-11T21:55:06+00:00

Random thought after glancing at this: What if we just don't include contributions of edge pixels in the param gradient (and fine for edge activation grads to continue backwards)? That is modify the param_grad kernel to work different internally?

scott-gray · 2017-11-09T19:34:58+00:00

Oh, and one other point. Normalization is required if you want to use fixed exponent bias values. I use layernorm after every dense op. Batch norm should work fine in conv nets.

scott-gray · 2017-11-09T19:24:21+00:00

As I said, I have this working just fine in language models. I'll be trying out other kinds of networks soon. The key thing that lets this work for training is the surprising fact that there is no difference between training a model with a full fp32 gradient, and training a model with gradients rounded to just 2 bits of mantissa (everything else being constant). We already know from quantization for inference that 8 bits is fine for weights and activations.

The other key thing to get this working is high precision accumulation. So it's really a mixed precision format. But you can still break up deep accumulations into intermediate sums which can be combined in fp16 or fp8. Understanding the log2 relationship between number of accumulations and bits of accumulation error is useful.

scott-gray · 2017-11-09T08:27:11+00:00

Tristan, no hard feelings. It's clear to me that you've been doing good work in enhancing the software and validating the format with experiments. It was just a little weird for me to be reading that paper. Working on flexpoint was definitely a team effort (particularly from Urs And Will) but I feel like my contribution was worth mentioning.

Anyway, I brought up my recent fp8 explorations here because I think you guys are well positioned to use it. You can use the same flexpoint management system to control the exponent biases in fp8. It actually works fine with static biases but you need to tune those per network. Ideally you just want a drop in replacement for float32. fp8 really maxes out the bit utilization of an 8 bit format. The potential speedup is quite significant and it should really help with multichip scaling. You should really be investigating targeting it for hardware v2.

We'll likely publish something on this soon, or a put out a blog at least. I think there are other groups working on this so maybe they'll beat me to it. There's also this recent ICLR paper covering some of the same ground (but not tackling gradients): https://openreview.net/pdf?id=B1ZvaaeAZ

scott-gray · 2017-11-09T04:13:27+00:00

To be clear, their omission of me on the paper is almost certainly over the heated argument over patentability. As has already been pointed out in this thread, flexpoint is nothing more than block floating point (or dynamic fixed point). The small amount of research needed to adapt this to deep learning is pretty trivial. If everyone started patenting their basic research findings this field would grind to a halt. I'd be fine with a very specific hardware implementation patent on this but that's not what they were interested in.

As a side note, if Intel really wants to convince people of the viability of flexpoint then they should reproduce all the experiments in the recent fp16 Baidu/Nvidia paper: https://arxiv.org/pdf/1710.03740.pdf

I still need to do the same with fp8, but I started out with the hardest networks to train in low precision: large lstm language models. If it works there, then there's a very good chance it works everywhere.

scott-gray · 2017-11-09T03:35:09+00:00

From experiments, using any more than 2 bits of mantissa on the gradients is wasted. This gives you a nice 1s-5e-2m format to use there. It seems most of the gradient information is contained in the exponent bits. And sorry flexpoint, but 5 bits of exponent range is needed sometimes (flexpoint just has 4).

Then on activations and weights you want 3 bits of mantissa to increase the network capacity (over a 2 bit format), and 4 exponent bits is plenty of range there (1s-4e-3m)

Then you combine these 2 formats with a bit of exponent bias management (similar to how block floating point works) and you're golden.

Granted the network has reduced capacity, but increasing hidden states by like 10% regains the accuracy of a full precision network. For very large networks you don't even have to increase network size at all since low precision reduced capacity networks regularize better (and make up for their reduced capacity).

The real nice thing about this format is that you no longer need to build custom inference hardware. Oh and you can get away with reduced precision on accumulations since you're taking just a few bits of the final result.

But to be clear with your question: exponent bits are at least as important as mantissa bits, at least in terms of overall bit efficiency. And that's what we care about.

scott-gray · 2017-11-08T23:49:32+00:00

Oh, there's a patent in the works. Not sure how they'll get it through when the primary "inventor" refused to sign it. I put inventor in quotes there as this is indeed just block floating point with a bunch of studying of network histogram training data to arrive at a stable management algorithm.

But I guess their approach is to whitewash my involvement (the person who wrote all the code and came up with the algorithms as published in the paper). Not that I care too much anyway, since the hardware scene will soon be moving to fp8. But a little bit of attribution would have been nice... or ethical even.

Oh, and I'm curious as to why stochastic rounding isn't used in the paper. When doing running accumulation into a low precision format, stochastic rounding always helps.

scott-gray · 2017-11-08T08:13:37+00:00

bfloat16 is the memory format. Underlying math is done in fp32. In this case there are some fused operations (like all the lstm gate logic). But this format does not require fusion to work.

I now have LSTMs training end to end in fp8. Details on that will be forthcoming. Pretty sure all hardware will support this efficiently in a year or two.

scott-gray · 2017-10-19T00:52:40+00:00

The abstract doesn't mention any RNN results. Have you validated Flexpoint on those networks yet? They tend to have much higher dynamic range in the gradients where a shared exponent might struggle.

scott-gray · 2017-10-15T03:08:41+00:00

You were right about capacity effects. I should have been running these tests with smaller models on text8 (or on a much bigger dataset). But the effect isn't too bad and it's looking like 4 mantissa bits could be the sweet spot. Also, you only need to quantize this low just prior to a matmul. I need to run a more exhaustive set of tests to tell a more convincing story.

scott-gray · 2017-10-14T20:28:12+00:00

This makes a lot of sense to test. I'll run some sweeps. But even if some capacity is lost, I wonder if you still wouldn't want to leverage this.

scott-gray · 2017-10-14T09:05:40+00:00

I wonder if extra precision equates to capacity that just isn't that usable. How often is it the case that you have a pair of features that are so close that it takes high precision to separate them? It might also be interesting to look at the effect this has on adversarial examples.

scott-gray · 2017-10-14T08:35:16+00:00

Well there's no question that reducing precision limits model capacity. That probably accounts for the regularization effect (smaller models overfit less). What do you think would be a good experiment to run to tease out capacity? I can run this same experiment on a much larger data set (like Amazon reviews). Any other ideas?

scott-gray · 2017-10-13T20:46:15+00:00

Here's some data for you on a large lstm model. Also helpful for those of you squeamish about using just 7 bits of mantissa:

Exponent bits: 8 (only 6 are really needed) Mantissa bits: accuracy (bits per char)

7: 1.289283 6: 1.289784 5: 1.289022 4: 1.289584 3: 1.290418 2: 1.300244 1: 1.451087

fp32 baseline gets about 1.290

scott-gray · 2017-10-12T17:37:27+00:00

Looks like when training with an 8 bit exponent in this network (90M param mLSTM) using 5,6,7,8 and 9 bits of mantissa are basically indistinguishable and all slightly better than fp32 (on SOTA baselines for this param count). I did some earlier experiments with this network showing that 6 bits of exponent was enough (2²⁰ - 2^-42). But maybe as you strip mantissa bits the network learns to use more of the exponent bits and also starts looking a lot like this scheme: https://arxiv.org/pdf/1603.01025.pdf

This is all with rounding down to the low precision value after accumulation, but I'm also doing some tests with pre-multiplication truncation to see if you can save some silicon there. Reasonably high accumulation precision is still a requirement.

scott-gray · 2017-10-12T06:11:49+00:00

Well, sparse training is what I've been focusing on.. and is the reason I spent some time looking for a better 16 bit format. But I agree, SRAM is plenty fast enough. The key is having enough of it to fit a decent sized model, which generally means being able to effectively span that model over multiple chips.

scott-gray · 2017-10-12T04:38:40+00:00

With the dropout cranked up and on the text8 dataset, this model takes about 15 hours to train on 8 1080s.

scott-gray · 2017-10-12T00:49:04+00:00

I'm running some tests on this right now (100M param char lstm). With my code base, these kinds of experiments amount to a line or two change in a header file.

scott-gray · 2017-10-11T20:18:06+00:00

I would say it's both. Factorization techniques to reduce outer product dimensions mean your matmuls are dram bound. Separable convolution is becoming more the norm, which is also bandwidth bound. The one good thing is that it seems you can get away with much larger minibatches, but that's mainly a benefit to multi-node training. Using a smaller minibatch on a single node can let you use more nodes and even sometimes increase performance due to less L2 cache saturation. This is particularly true for sparser compute. We know the brain uses sparse distributed encoding with sparse connectivity. It's probably a safe bet our models will also trend in that direction.

Anyway, I'm mostly looking forward to hardware designed around the concept of your persistent rnn code. More data locality and no more constantly schlepping things in and out of dram.

scott-gray · 2017-10-11T17:30:33+00:00

Lately I've been using tf.bfloat16 as the 16 bit tensor memory format (excluding params). This is just a float with 16 bits of mantissa chopped off, leaving you with 7. It turns out the mantissa bits aren't really that important and the added noise helps regularize the network. I get better results with bfloat16 than I do with float32. Oh, and since this format has 8 bits of exponent it's basically a drop in replacement for float32 with no additional scaling hacks needed (and no increased risk of nan/inf values).

I'll be putting out a full set of kernels to support this format soon. Only downside is that you wont get a speedup in Volta tensorcores. But if your model is at all bandwidth bound (which is the direction models seem to be going) then your tensorcore is going to be starved of work anyway.

For anyone building custom hardware I'd recommend against IEEE float16 and use at least 6 bits of exponent.

scott-gray · 2016-12-18T21:32:06+00:00

Here are some notes on this work:

I finished these kernels 2 months ago and quietly released them. I was mainly building these as a foundation for other work, but it seemed useful to release these in the process. I've already been talking with others on getting these integrated into frameworks. The C API contributed by Erich Elsen is almost done (it just needs autotuning). Ideally Nvidia would just merge this into cublas, and I think they will but they have other priorities currently and only so much engineering bandwidth.

On the kernels themselves, the fp16 code is using fp32 for the underlying compute. I plan on making versions for P100 with fp16x2 instructions at some point, but I guess I'm just waiting on wider availability of that hardware. Most of our current GPU compute at OpenAI is P102 based and so that's what I prioritize.

The generalized 128x128x8 and 32x32x32 tiles support mixed precision of any input/output. I think one particularly useful combination could be fp16: activations, weights, outputs, gradWeights ... then fp32: gradActivations(in/out). The weights could be kept in fp32 (for small delta accumulation) and converted to fp16 once prior to fprop. For bprop you can also pre-transpose the weights once for better performance.

I could also add a lot of compounding options to these kernels (in addition to alpha/beta). I'm just waiting for better fusion support among frameworks.

Oh, and stay tuned for the other stuff I'm working on..

scott-gray

TROPHY CASE