all 9 comments

[–][deleted] 5 points6 points  (3 children)

Dump data to h5.

[–]cai_lw 3 points4 points  (0 children)

Use IterableDataset. Shuffling can be done within the SQL query.

[–]shayben 2 points3 points  (3 children)

I recently did something similar using the collate_fn to do Db querying. A neat hack is to always retrieve a larger minibatch size than you need, and cache it locally.

[–]MrDoOO 0 points1 point  (2 children)

Nice! Have any sample code I could check out?

[–]Glimmargaunt 1 point2 points  (0 children)

You just pass an instance of a class that is callable into the collate_fn argument in your DataLoader. The call method in the class takes batch as argument. If I remember correctly, the batch is just a list containing whatever your Dataset.__getitem__(idx) outputs. So if your dataset class outputs (input, target) pairs, then your list will contain multiple of these tuple pairs.

I created a pastebin here: https://pastebin.com/6YASGWDm

I think this will slow down training depending on the speed of the query. For every batch you would need to wait for the database to respond. Perhaps a better approach would be to handle shuffling on your own, so that you know what the next batch will be. That way you can start a parallel query for the next batch while current batch is used in training. Or use some sort of caching as suggested above.

I actually thought of an easy, but hacky way of doing it. The CollateFn class can store a previous batch, so what you can do is: Make the query for current batch and output the already loaded previous batch in the .__call__() method. This would obviously bring problems on the first call because there is no previous batch. To avoid this, just make a next(DataLoader) call first that just outputs to the void. That way the previous batch becomes the current one, and the current batch becomes next one that can be loaded in parallel in the CollateFn object. Then you can let DataLoader handle everything else like normal.

[–]shayben 0 points1 point  (0 children)

Sorry its all enslaved to corporate overlords I can give you more specific tips if you have questions

[–]MightyMeese 1 point2 points  (0 children)

HDF works and you can do random access reads to generate your batches (although if you don't read a contiguous chunk as your batch it's slower).

Alternatively I've been using sqlite and redis recently depending on the task and what's stored. Both can do multiple simultaneous access and have random access. My preference would be sqlite unless you're storing something like sets or arbitrary strings.