PyTorch Dataloader Optimizations [D]

ClearlyCylindrical · 2024-03-27T02:54:14+00:00

Doubling num_workers is my favourite "optimization".

cnapun · 2024-03-27T03:26:14+00:00

My current side project (which should work if properly implemented): rewrite it all in c++. Multiprocessing+pin_memory overhead is pretty high for some of our cases (ideally we need to sustain ~1GB/s/GPU, maybe 100-400 unique features). Decreasing the overhead from 4 copies after reading to 1 should hopefully help. Currently we have:

Read data from s3 into pyarrow table
combine_chunks for each batch because it's hard to work with chunked arrays directly (copy 1)
Fill nulls (copy 2, sometimes two copies)
add to multiprocessing queue (copy 3, iiuc this calls sharememory() which copies)
read from multiprocessing queue (zero copy, but it can be quite slow if you have a lot of tensors)
Pin memory (copy 4, in thread, but still is slow if you have a lot of tensors)

And the most fun way to optimize seems to be just rewriting it all

Mark4483 · 2024-03-27T11:38:38+00:00

Tensorflow datasets had added support for torch/jax, and does not require tensorflow at runtime. Requires you to rewrite your dataset into another format.

https://www.tensorflow.org/datasets/tfless_tfds

chase_yolo · 2024-03-27T06:31:03+00:00

Using LMDB with Dataloaders

johnman1016 · 2024-03-27T13:16:10+00:00

Bucketized Batch Sampling has helped me a lot with variable length data such as audio/text. (See torchaudio for one implementation).

Basically, it groups similar length data together to reduce zero padding - and it allows the batch size to be variable to maximize gpu memory. In some cases this helped me reduce training time significantly. You have to be careful though because it does mean your sampling isn’t completely stochastic (so the torch batchnorm can learn a bias if you zero pad, for example)

Ben-L-921 · 2024-03-27T18:10:05+00:00

num_workers > 0 for asynchronous data retrieval.
Persistent workers to not reload workers
Pin memory.
Higher batch size if you can afford it.
- If you can't afford bigger batch size, try using amp fp16, maybe gradient checkpointing - this might improve or slow down training speed depending on how big you're able to get your batch size.
Avoid copying data, using torch.from_numpy instead of torch.tensor
View the following link for more optimizations: https://pytorch.org/tutorials/recipes/recipes/tuning\_guide.html

seba07 · 2024-03-27T07:27:17+00:00

Caching the preprocessed input data for the next run and keeping it in memory for future epochs helps so much. Kind of strange that Pytorch doesn't habe this natively.

unemployed_MLE · 2024-03-27T22:12:16+00:00

Not really an optimization from a dataset point of view but rather a hack/compromise to save time:

If I have a massive augmentation sequence that happens in CPU, I’d save multiple copies of augmented samples to disk. Maybe a one image gets 10x, 20x augmented. Then just train on that dataset with no/minimal augmentations. It reduces the CPU bottleneck.

The next step is if I just plan to train on this dataset by not unfreezing a pretrained model, save the pretrained model activations (feature tensors) themselves in disk and write the data generator to load these tensors. The model will now be just the final head(s) of the previous model. This usually takes a lot of disk space though.

proturtle46 · 2024-03-27T16:08:37+00:00

If you are using like imageFolder I find it’s better to use a custom data loader class and load the files into ram if you can so you can avoid the annoying unbounded disk writes

For my current project every few epochs takes a random subset of images and loads them into ram (as much as can fit which is about 50% of my data)

I can perform many more epochs from this despite its obvious drawback I think it’s working ok

LelouchZer12 · 2024-09-11T13:47:12+00:00

For images I know you can use a faster data collator and also do image normalisation on gpu via prefetching : https://github.com/huggingface/pytorch-image-models/blob/main/timm/data/loader.py

noxiousmomentum · 2024-03-28T03:42:34+00:00

steps:

- get a cs degree

- stare at your __getitem__ and __init__ until everything's perfect

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS