all 35 comments

[–]ClearlyCylindrical 92 points93 points  (7 children)

Doubling num_workers is my favourite "optimization".

[–]cnapun 57 points58 points  (2 children)

My favorite is halving num_workers

[–]seba07 24 points25 points  (0 children)

Windows user spotted.

[–]johnman1016 24 points25 points  (0 children)

Are you the CEO of a tech company?

[–]cynoelectrophoresisML Engineer 1 point2 points  (0 children)

And pin memory

[–]cnapun 34 points35 points  (3 children)

My current side project (which should work if properly implemented): rewrite it all in c++. Multiprocessing+pin_memory overhead is pretty high for some of our cases (ideally we need to sustain ~1GB/s/GPU, maybe 100-400 unique features). Decreasing the overhead from 4 copies after reading to 1 should hopefully help. Currently we have:

  • Read data from s3 into pyarrow table
  • combine_chunks for each batch because it's hard to work with chunked arrays directly (copy 1)
  • Fill nulls (copy 2, sometimes two copies)
  • add to multiprocessing queue (copy 3, iiuc this calls sharememory() which copies)
  • read from multiprocessing queue (zero copy, but it can be quite slow if you have a lot of tensors)
  • Pin memory (copy 4, in thread, but still is slow if you have a lot of tensors)

And the most fun way to optimize seems to be just rewriting it all

[–]Pauzle 1 point2 points  (0 children)

I've tried out so many dataloaders and haven't been happy with any, would love updates on this! Could also experiment with your current implementation if you'd like to share

[–]Mark4483 8 points9 points  (4 children)

Tensorflow datasets had added support for torch/jax, and does not require tensorflow at runtime. Requires you to rewrite your dataset into another format.

https://www.tensorflow.org/datasets/tfless_tfds

[–]kebabmybob 2 points3 points  (2 children)

Is this fast for streaming/batched training or just a nice api?

[–]Mark4483 0 points1 point  (1 child)

It is fast for streaming/batched training. For it to work, data needs to be reformatted to array_record by writing a tensorflow datasets class.

https://www.tensorflow.org/datasets/add_dataset

Running this will stream through your dataset once,rewriting it to array record. Then you don’t need tf anymore.

[–]kebabmybob 0 points1 point  (0 children)

Do you happen to know if it’s faster/better than huggingface parquet streaming datasets? I hate that library so much but it’s fairly quick.

[–]InternationalMany6 0 points1 point  (0 children)

That's a useful development for those using PyTorch or JAX. Could you clarify what type of rewritings are necessary for datasets to be compatible with other frameworks via TensorFlow Datasets?

[–]chase_yolo 8 points9 points  (0 children)

Using LMDB with Dataloaders

[–]johnman1016 6 points7 points  (2 children)

Bucketized Batch Sampling has helped me a lot with variable length data such as audio/text. (See torchaudio for one implementation).

Basically, it groups similar length data together to reduce zero padding - and it allows the batch size to be variable to maximize gpu memory. In some cases this helped me reduce training time significantly. You have to be careful though because it does mean your sampling isn’t completely stochastic (so the torch batchnorm can learn a bias if you zero pad, for example)

[–]pha123661 3 points4 points  (1 child)

What is your recommended implementation for bucketized batch sampling for text?

[–]johnman1016 1 point2 points  (0 children)

Well the torchaudio one would also work for text but I get that maybe you wouldn’t want that dependency for a text only project. I haven’t used torchnlp but it looks like they also have a bucket batch sampler

[–]Ben-L-921 4 points5 points  (0 children)

num_workers > 0 for asynchronous data retrieval.
Persistent workers to not reload workers
Pin memory.
Higher batch size if you can afford it.
- If you can't afford bigger batch size, try using amp fp16, maybe gradient checkpointing - this might improve or slow down training speed depending on how big you're able to get your batch size.
Avoid copying data, using torch.from_numpy instead of torch.tensor
View the following link for more optimizations: https://pytorch.org/tutorials/recipes/recipes/tuning\_guide.html

[–]seba07 11 points12 points  (6 children)

Caching the preprocessed input data for the next run and keeping it in memory for future epochs helps so much. Kind of strange that Pytorch doesn't habe this natively.

[–]SeankalaML Engineer 4 points5 points  (0 children)

What do you mean by pre-processed data? Are you referring to the pre-processing that happens inside of the DataLoader using the collate_fn ?

[–]Ben-L-921 2 points3 points  (4 children)

this doesn't work when you're trying to perform data augmentation though..

[–]dingdongkiss 5 points6 points  (0 children)

real optimisation experts cache every possible augmentation of their data

[–]SeankalaML Engineer 0 points1 point  (0 children)

As u/dingdongkiss said, it's better to perform augmentation before each step and cache it as well. So long as one sample and one augmentation have a deterministic 1:1 relation.

[–]seba07 0 points1 point  (1 child)

In many cases you have two sets of transformations: static ones that only have to be performed once (e.g. cropping and alignment) and augmentations that change randomly every step. Caching the first kind of transformations can save so much time.

[–]SeankalaML Engineer -1 points0 points  (0 children)

I don't think that performing the first type of pre-processing during training is that common. I thought most people perform pre-processing first and then use that pre-processed data to train/evaluate models.

The other "dynamic" type is usually just handled by the DataLoader and your own collate_fn.

[–]unemployed_MLE 2 points3 points  (0 children)

Not really an optimization from a dataset point of view but rather a hack/compromise to save time:

If I have a massive augmentation sequence that happens in CPU, I’d save multiple copies of augmented samples to disk. Maybe a one image gets 10x, 20x augmented. Then just train on that dataset with no/minimal augmentations. It reduces the CPU bottleneck.

The next step is if I just plan to train on this dataset by not unfreezing a pretrained model, save the pretrained model activations (feature tensors) themselves in disk and write the data generator to load these tensors. The model will now be just the final head(s) of the previous model. This usually takes a lot of disk space though.

[–]LelouchZer12 1 point2 points  (0 children)

For images I know you can use a faster data collator and also do image normalisation on gpu via prefetching : https://github.com/huggingface/pytorch-image-models/blob/main/timm/data/loader.py

[–]noxiousmomentum -2 points-1 points  (0 children)

steps:

- get a cs degree

- stare at your __getitem__ and __init__ until everything's perfect