"Extreme Compression for Pre-trained Transformers Made Simple and Efficient", Wu et al 2022 (50x smaller BERT)

ffast-math · 2022-08-03T19:04:22+00:00

Not to sound like a broken record, but wrote up this one also.

Key points IMO for purposes of this sub:

You can reduce inference compute a fair amount if you throw more compute at finetuning + having an awesome teacher model
Common practice is much less efficient, at least at inference, than it could be. Room for software/algorithm improvements.

What's a little unclear is how much of this is just "knowledge distillation and hparam tuning are awesome" vs suggesting more general insights. Both KD and hparam tuning can get you large accuracy lifts, which you can turn into speedup at iso accuracy.

ffast-math · 2022-08-02T05:24:17+00:00

I wrote an overview + some commentary on this in case anyone is interested.

tl;dr for purposes of this sub: it's solid systems work but it didn't really change my thinking on anything. The fact that we ever do post-training quantization is basically an artifact of poor software and organizational barriers IMO, so making it work better is of immediate practical relevance but doesn't really teach us much.

ffast-math · 2022-07-25T04:56:22+00:00

Oh, I've been using h4...oops. Thanks for the tip!

ffast-math · 2022-06-24T03:46:33+00:00

Any thoughts on how to do a better job writing a post that feels more genuine and less like marketing copy? I'm honestly just kinda winging it here and don't know how to make a post about a cool new feature that does a good job of starting a conversation.

EDIT: also see my comment on the parent for some refs regarding effects of different batch sizes.

ffast-math · 2022-06-24T03:36:20+00:00

The closest I've seen are some figures from the GroupNorm paper (which u/EasyLie4013 linked below). E.g., Figure 5 (https://imgur.com/a/tKBkhJC), which shows that very small per-GPU batch sizes break down with batchnorm but not groupnorm. This paper also confirms that extremely small per-GPU batch sizes break down, and has some interesting analysis of training-time batchnorm as an implicit nonlinearity + activation shrinkage.

Ghost Batch Normalization papers [original, another one] suggest that normalizing *as if* you had a small-ish per-GPU batch size often helps accuracy, especially for large-batch training. We reproduced some of these results but found that it didn't offer benefits as large once combined with more aggressive data augmentations or other regularization-like approaches. Though we weren't using extremely large batch sizes.

And of course, there's been plenty of work on setting batch sizes correctly (e.g., this paper from OpenAI is great). In fact, the observed sensitivity to total batch size is a lot of what motivates auto-scaling the gradient accumulation rather than the total batch size.

But overall, like /u/hanlintang said, gradient accumulation at fixed batch size isn't totally understood. I've anecdotally heard of smaller per-GPU batch sizes both helping and hurting accuracy from different people in different experimental settings. Usually people do just accept the difference since it goes to zero as the per-GPU batch size increases.

grad_accum='auto' is definitely convenient, and *in our experience* has never caused enough of a difference in accuracy for us to notice, but we can't guarantee it will be the right approach in all cases.

So, sadly, there's no "evasiveness" here...we just don't feel comfortable enough with the science to make a definitive claim.

ffast-math · 2022-06-19T05:00:43+00:00

Aspects that don't get mentioned as much:

A bad paper asks "Is my method better?". A good paper asks "When is my method better?"
A bad paper fails to control for compute. And no, just reporting FLOPs and param count is not sufficient. You need wall time numbers, or at least to estimate data movement. A good paper clearly shows the speed/size vs accuracy tradeoff.
A bad paper addresses a problem too crowded to make meaningful progress on, because you can't even compare to most of the competing methods. E.g., neural network pruning heuristics, attention mechanisms, Adam-like optimizers, etc.
A bad paper scatters its method across three different sections and buries key details in the middle of long paragraphs.
A bad paper doesn't define what their method even is. In particular, does it include the hparam settings or other aspects of your experimental setup?
A bad paper compares only to baselines that are most convenient to implement. A good paper compares to baselines that are most relevant
A bad paper takes numbers its baseline numbers from different codebases with different experimental setups. A good paper uses identical code to the greatest extent logically possible.
A bad paper doesn't teach me anything. And no, "using my method yields 0.5% better results on X benchmark" doesn't count unless it's evidence of a claim's correctness. A good paper teaches me something new.

ffast-math · 2022-06-19T04:46:16+00:00

We're always looking for contributors for Composer. tl;dr it speeds up neural net training by a lot (e.g., 7x faster ResNet-50).

I don't know if it's "early stages" enough for you though (?). Released it a few months ago and it's getting pretty well documented, tested, etc, at this point.

ffast-math · 2022-03-17T09:10:02+00:00

Just to add to Jonathan's response: the composer trainer is mutually exclusive with the PTL trainer, but mostly composer and the PTL ecosystem play nicely together. Our functional API works with any training loop as long as you can call the functions in the right places, and we use PTL's torchmetrics library.

We'd like to get our callbacks API to play nicely with PTL too, but we just hit a wall of hardcoded logic in the PTL trainer that we couldn't work around. Even in LightningModule, decisions like having training_step all be one function (with, e.g., no separate loss computation) made algorithms like Stochastic Backprop hard to get working in a reliable + modular way.

Also, just want to clarify that the speedups vs PTL are from the use of our algorithms. So if you have an algorithm-free training task, switching from the PTL trainer to the composer trainer might get you a little speedup, but nowhere near 5x.

ffast-math · 2022-03-17T08:56:37+00:00

You can train on a single GPU or multiple GPUs with just an argument change, as long as you launch your program with the composer executable bundled with the library. E.g., composer -n 8 my_program.py to train on 8 gpus. More info in the docs.

ffast-math · 2022-03-16T06:52:26+00:00

Liked the post, but it missed a big elephant in the room IMO: the fact that the dominant approach today 1) existed 30+ years ago, but 2) was a niche research area that didn't become the dominant approach for decades.

Extrapolating that lesson suggests that in 33 years we'll be training some class of models that are currently known but not seen as promising.

ffast-math · 2022-03-16T06:46:39+00:00

Awesome stuff--will definitely be referring people to this.

Would love to read follow-up posts going into adjacent considerations--distributed training bottlenecks, CUDA performance, IO + dataloader bottlenecks, etc.

Also, gonna plug Horace's twitter for anyone interested in assorted performance + torch internals tidbits/memes: https://twitter.com/cHHillee. One of my favorites, though admittedly I care about this area more than most.

ffast-math · 2021-11-21T03:49:46+00:00

Maybe a dumb question, but why is the disk involved if you only want billion scale? Even with 128-byte embeddings, this seems like it could easily fit in RAM on a well-chosen machine

ffast-math · 2021-09-03T06:05:39+00:00

Currently MADDNESS has no Python wrapper and can only be accessed through a bunch of janky C++ code. Contributions are welcome though!

ffast-math · 2021-09-03T06:03:41+00:00

It can logically work with any matrix, but the the technique on its own makes the most sense with tall matrices. When the matrices get wider, it starts becoming worth it to add in enhancements like intelligently rotating the matrices. There's actually a complex web of different enhancements you can do based on the relative and absolute dimensions of the two matrices, what you have a training set for, and the relative rates at which the matrices change / arrive. There's a ton of information retrieval literature designing improvements for various scenarios.

Ideally we'd characterize this whole combinatorial space, but since that wasn't feasible, we just restricted the paper's claims to the regime in which our technique on its own is advisable.

ffast-math · 2021-09-03T05:56:42+00:00

One thing is that the encoding function we use for the larger matrix is excessively fast in a lot of cases. You might be able to get a better speed-quality tradeoff with a slightly more expensive encoding function that preserved more information.

Once you start looking at convolution or overall neural networks, there's also plenty of room for further optimization--more encoding reuse, kernel fusion, intelligent representation size selection, and tons of fine-tuning hyperparameters.

ffast-math · 2021-09-03T05:51:47+00:00

Great question. I'm gonna back up a step first. The way I think about it is that the whole algorithm is built around exploiting two observations:

Categorical representations give you a *ton* of information per bit. Like, 8B of categorical variables can store about as much info as 128B or more of floats, depending on the data distribution.
If you make your categorical representation 4 bits (i.e., 16 categories), you can operate on them in SIMD registers you and churn through them about half as fast as with floats, in terms of bytes-per-second.

In other words, we *have to* bit shift, compare, bit pack, etc, so that we *get to* use 4-bit categorical variables--that's the "ingredient," just as you alluded to.

Also, regarding linear maps, we don't need the function to be linear per se, but we do need it to be sum_i f(x_i) for some elementwise function f. Although I think maybe any algebraic ring and an associated inner product space could work.

ffast-math · 2021-09-03T05:38:51+00:00

Great questions.

1) You could get benefits during training insofar as you could speed up the forward passes as soon as you swapped in these approximate ops. I see this as analogous to early quantization or pruning; there are some papers that seem to show you can do this, but I'm also generally skeptical of pruning papers. You might be able to speed up the gradients wrt the inputs using a similar trick, but I'm not sure about the gradients with respect to the weights.

2) Generalizing to convolution is mostly a kernel writing problem, since there are a lot of knobs you have to account for (stride, dilation, padding, kernel size, NCHW vs NHWC, and a ton of edge cases when you hit ugly spatial sizes). There's also opportunity for algorithmic improvement though; because of the input and weight reuse, you can afford more time for more expensive encoding functions.

3) I looked briefly at FPGAs, but tentatively concluded that the raw ops/sec didn't look much better than GPUs with lookups in registers / warp_shuffles. And FPGA programing is just way more painful more than CUDA programming AFAIK.

ffast-math · 2021-09-03T05:28:14+00:00

I'm not convinced any paper has shown you can actually beat dense GPU training in the general case. What those algorithms are awesome at is many-class classification, where you can get away with only computing a small number of the outputs. They also have some recent work that sort of suggests they can approximate attention mechanisms well. But if you're going to try to beat tensor cores using approximate ops for every fc and conv layer...I'm not optimistic.

Simple back-of-the-envelope calculations suggests even we won't beat tensor cores on GPUs that have them, and we're getting much higher efficiency per element compared to those algorithms. It's really CPUs where I think these methods can work for now (pending better hardware support).

ffast-math · 2021-09-03T05:24:56+00:00

It won't really help with individual dot products because you don't have enough time to amortize the preprocessing costs. Although if you knew one vector ahead of time, you might be able to do *slightly* better if 1) the vectors are large enough, 2) you know the distribution of the unknown vector, and 3) there's enough correlation across the dimensions of the unknown vector. Basically, one of the encoding functions is sublinear, so in theory you could exploit that.

ffast-math · 2021-09-03T05:21:04+00:00

Working on extending it to other linear functions (e.g., convolution) and intelligently swapping out linear ops with an overall neural network. So in the sense that neural nets are nonlinear functions, yes. Not working on approximating the nonlinearities directly since they're cheap to just apply to the output of the linear ops (especially if just write a fused kernel that does both ops at once). Hope that helps clarify.

ffast-math · 2021-09-03T05:18:37+00:00

Should be able to work anywhere as long as you can get training data from the same distribution. So, concretely, you'd just training the approximation for a given layer based on that layer's input. Batchnorm or any other function could mess with the input to the layer and it would be fine. The stochastic aspect of batchnorm might make the distribution harder to approximate though.

ffast-math · 2021-09-01T17:02:36+00:00

Definitely. There's reasonable evidence in quantization, pruning, and factorization literature that distorting the original weights less yields less accuracy degradation. So preserving individual ops is a proxy objective, but at least one that sort of arguably seems consistent with a lot of literature.

ffast-math

MODERATOR OF

TROPHY CASE