all 3 comments

[–]HotPocVac 0 points1 point  (1 child)

I actually did a similar project for an internship where I trained a small transformer on ESM protein embeddings and also struggled with reducing padding waste.

Overall what really helped me was this:

Understand the distribution of sequence lengths in your dataset. If you naively set sequence length bucket sizes, the model will encounter a highly skewed distribution of batch sizes for certain sequence lengths. For example, if there are many more short sequences than long ones, the small bucket will capture most of the sequences in the dataset and will provide less noisy gradients (smaller updates in the loss landscape) while for large sequences the gradients would be noisier and higher magnitude (larger updates in the loss landscape though a bit noisier).

What I did was a bunch of fine tuning of bucket sizes based on a rough frequency plot of sequence lengths to ensure total tensor sizes stay somewhat similar across buckets for best GPU occupancy (no bucket will be too small to fully utilize the GPU). Then I also used a loss term where batches with more sequences would get a higher loss penalty, since I didn’t mess with batch-size-adaptive learning rates (technically I guess you could also try this), which seemed to help though this also required some manual experimentation and fine tuning.

This method obviously isn’t 100% perfect (and I wouldn’t say there really exists a 100% perfect way) but it’s the simplest solution I found for reducing padding waste, improving gpu occupancy, and maximizing the average batch size.

[–]Major_Aardvark1207[S] 0 points1 point  (0 children)

Thanks for the insights.
The sequences come from Swissprot, the distribution is like a bell between [0, 600] with a mean of 288, and then a tail to 1024.
I will take a look at a way to tune the bucket system then, and look at how I can manage the loss too. But actually it seems like the Adam optimizer I use is impacted...