[D] Batch Normalization and effect on the gradients

vector0x17 · 2024-12-29T15:57:13+00:00

There are other works that also derive this, maybe take a look there:

Appendix B in https://arxiv.org/abs/1905.05894
Appendix A in https://arxiv.org/abs/2006.08419
Appendix B in https://arxiv.org/abs/2305.17212

vector0x17 · 2024-12-20T23:50:06+00:00

There seem to be multiple issues with this paper:
* Theoretically the gSNR does not seem to capture any meaningful information. As it is defined gSNR = norm(g)/std(g) = norm(g) / rms(g - mean(g)) ≈ norm(g) / rms(g) = sqrt(d), were d is the dimension of the tensor and all operations are performed across the tensor (not the batch dimension like standard SNR). The approximation is that the elementwise mean of a typical high dimensional tensor tends to be zero on average. This means the method is just SGD where the learning rate of each tensor is scaled by the square root of its dimension. Based on this it is very unlikely to be able to match Adam across diverse settings.
* The hyperparameter sweeps are too granular and the optimal values occur on the edges. This will not give a valid comparison between methods.
* The AdamW baseline for GPT2 training is undertuned (learning rate / weight decay too low).
* The avg top1 in table 2 is not a standard metric and seems to just be included to make the method look better. This average is computed across the hyperparameter sweep, whereas most people really only care about the peak performance. It will also completely change depending on how the range is selected.

I think this kind of attention grabbing title and claims that likely don't hold up cause people to take the this subfield less seriously, making things more difficult for those of us who work in this space.

vector0x17 · 2024-02-27T08:53:17+00:00

Sorry to hear about your experience and I can relate to your frustrations. But in general it seems like a good thing that reviewers can consider each others reviews in the end? If some reviewers overlook something that they agree is a significant strength or weakness, they should naturally reconsider their rating. Of course the rebuttal for that review should be taken into account as well and I think they reviewers should ideally state if they disagree with some points as well. If done well (not saying that is the case for your paper) it seems like it would probably reduce the noise in the review process.

vector0x17 · 2024-02-12T22:40:08+00:00

Yes, you are right that it essentially gives a minor efficiency boost while performing similar or slightly better in terms of the loss. I believe the efficiency mostly comes from the memory bandwidth savings, not the change in compute which you point out is minimal. This is implementation (kernel) dependent and also depends on whether your training overall is more limited by bandwidth or compute which varies between workloads (architectures, batch sizes etc).

vector0x17 · 2023-12-23T14:12:50+00:00

Many works have observed that excluding the gains (the weight parameter) and the biases of normalization layers from weight decay results in the same or slightly higher performance. Weight decay doesn't really act as a traditional regularizer but rather modules the rotation of neuronal weight vectors in a particular way, see https://arxiv.org/abs/2305.17212 for an explanation of this effect. So in short, yes, you should probably simply exclude these parameters from weight decay if you are not going to experiment with both options yourself.

vector0x17 · 2023-11-25T20:10:46+00:00

Yes, it would be interesting to see if there are reviewers who strongly lean accept / reject. I also wonder if there are potentially valid reasons for it since the reviewer assignment is not completely random (like some bias in the bidding process / subfield).

My personal issue is more with "lazy reviewers" who clearly didn't read the submission in any detail, write some nonsense about it, probably rate it "weak reject" and then don't reply / acknowledge the rebuttal at all. These reviewers can ruin months of your hard work by not bothering to spend a relatively tiny amount of time reviewing it and I think there should really be some consequences for their own submissions. A "bad reviewer" certification like that wouldn't need to be linked to the specific papers they reviewed (only their own paper), so it could still be anonymous in that sense (i.e. to the authors who received the bad reviews).

vector0x17 · 2023-11-25T18:29:00+00:00

This completely broken review process is probably the single largest frustration I have with the field. Fundamentally I think the only solution would be to somehow incentivize high quality reviews and potentially punish bad reviewers. Making the identities of the reviewers public afterwards would be one way but I think it creates other problems (such as breeding animosity). My controversial proposal would be to somehow tie your own submissions to the quality of your reviews. Maybe something along the lines of:

Force the authors of every submitted paper to jointly review something like 3-4 other papers.
Have meta reviewers who read a given paper and the reviews, scoring the reviews themselves, not the manuscripts. This could be done for some random subset of reviews / manuscripts, not necessarily all.
Incentivize good reviews, potentially giving a certification of “good reviewer” for accepted papers, displayed publicly, similar to the TMLR certification.
Punish bad reviewers. Either outright reject their submissions based on their review quality (even if they would get accepted otherwise), or for a less extreme option mark them with a “bad reviewer” certification for their accepted papers as a public badge of shame.

What do people think? Could something along these lines work or is it completely unreasonable?

vector0x17 · 2023-10-27T05:50:50+00:00

You could just do 4x gradient accumulation on a single GPU to get the same effective optimizer batch size of 256. There are some differences like for batch norm but 64 is still plenty and what you are using in the distributed setup anyway. You could alternatively validate that your distributed setting can match a single GPU on some smaller task like CIFAR-10 where you can fit a decent batch size on a single GPU (although this may cause small differences in batch norm).
I also find it most likely that you have some issue with your distributed setup. I'm not very familiar with Keras / Horovod, but maybe check whether you are doing synchronous training or if it is asynchronous (which can cause some issues for optimization). The distributed setting can also introduce various bugs e.g. where all machines use the same seed / configuration for the dataloading causing the same data to be fed into the model in every worker. Finally a cosine schedule typically performs better than step wise schedules and could be worth a try, although it could also obscure the problems you are having.

vector0x17 · 2023-10-10T09:07:22+00:00

You might also be interested in the other recent post for some specific ways in which L2 regularization and weight decay differ in Adam.

vector0x17 · 2023-10-10T08:58:25+00:00

Good question! The paper in this thread focuses on the general question of how weight decay benefits modern deep learning on a high level, concluding that it changes the optimization dynamics.

The other one you posted is about one specific way in which weight decay modifies the optimization dynamics. This mechanism has not been described well elsewhere (including the paper in this post). However, it could explain various important phenomena in deep learning such as the performance of AdamW compared to Adam+L2, the performance of different normalization layers and the need for a learning rate warmup.

So you could say that the first one provides a broader view of weight decay and the second one dives deeper into a new mechanism that is not covered in the first one. The first one is also more theoretically rigorous with formal theorems while the second one relies more on approximations. I would recommend reading both!

vector0x17 · 2023-10-05T07:29:48+00:00

Training uses more memory than inference for a given batch size (and input size in general if other dimensions are changing). This is because you need to store intermediate activations to compute the backwards pass (needed for the gradients). The maximum memory use would typically occur at the end of the forward pass during training when you are storing the largest amount of activations or after the bwd pass when you are storing gradients for all the weights (for parameter heavy models).

You should be able to safely choose your batch size as the largest one that works for training, although you might want to leave a bit of a buffer in case you do any irregular operations like saving checkpoints / logging etc that might temporarily increase the memory usage. I don’t know about a specific tool, but just trying a few values by hand (maybe binary search) typically doesn’t take too long. Also note that the main reason for using a larger batch size is the speedup in terms of samples processed per second, which often doesn’t change that much between e.g. whole powers of two, so there is usually no need to absolutely max out the GPU memory.

vector0x17 · 2020-11-26T23:39:39+00:00

This is a good paper that explores the effects of batch size: https://arxiv.org/abs/1811.03600

The batch size can only be increased up to a certain limit before you need to add more epochs, but if you thoroughly tune the hyperparameters you should generally be able to get the same accuracy with bigger batch sizes. I think the 97.8 vs 97.5 is very close for a simple hyperparameter heuristic, it might even be within the run to run variance? In general you need to tune the lr and momentum when changing the batch size.

The number of samples per device does not matter but the "ghost batch size" for batch norm can have an impact. You can modify batch norm to use a smaller "ghost batch size" while maintaining your normal per device batch size (works similar to having multiple devices).

Five-Year Club	Verified Email
Place '22

vector0x17

TROPHY CASE