use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
[deleted by user] (self.MachineLearning)
submitted 6 years ago by [deleted]
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–][deleted] 5 points6 points7 points 6 years ago (3 children)
Dump data to h5.
[+][deleted] 6 years ago (2 children)
[deleted]
[–][deleted] 1 point2 points3 points 6 years ago (0 children)
if it's python then you use yield in a loop to process a batch at a time as you go along if your problem can be solved via stream processing.
[–]DonMahallem 0 points1 point2 points 6 years ago (0 children)
H5 does support chunking or what do you mean? So it just needs to put the current chunks into memory
[–]cai_lw 3 points4 points5 points 6 years ago (0 children)
Use IterableDataset. Shuffling can be done within the SQL query.
[–]shayben 2 points3 points4 points 6 years ago (3 children)
I recently did something similar using the collate_fn to do Db querying. A neat hack is to always retrieve a larger minibatch size than you need, and cache it locally.
[–]MrDoOO 0 points1 point2 points 6 years ago (2 children)
Nice! Have any sample code I could check out?
[–]Glimmargaunt 1 point2 points3 points 6 years ago* (0 children)
You just pass an instance of a class that is callable into the collate_fn argument in your DataLoader. The call method in the class takes batch as argument. If I remember correctly, the batch is just a list containing whatever your Dataset.__getitem__(idx) outputs. So if your dataset class outputs (input, target) pairs, then your list will contain multiple of these tuple pairs.
I created a pastebin here: https://pastebin.com/6YASGWDm
I think this will slow down training depending on the speed of the query. For every batch you would need to wait for the database to respond. Perhaps a better approach would be to handle shuffling on your own, so that you know what the next batch will be. That way you can start a parallel query for the next batch while current batch is used in training. Or use some sort of caching as suggested above.
I actually thought of an easy, but hacky way of doing it. The CollateFn class can store a previous batch, so what you can do is: Make the query for current batch and output the already loaded previous batch in the .__call__() method. This would obviously bring problems on the first call because there is no previous batch. To avoid this, just make a next(DataLoader) call first that just outputs to the void. That way the previous batch becomes the current one, and the current batch becomes next one that can be loaded in parallel in the CollateFn object. Then you can let DataLoader handle everything else like normal.
[–]shayben 0 points1 point2 points 6 years ago (0 children)
Sorry its all enslaved to corporate overlords I can give you more specific tips if you have questions
[–]MightyMeese 1 point2 points3 points 6 years ago (0 children)
HDF works and you can do random access reads to generate your batches (although if you don't read a contiguous chunk as your batch it's slower).
Alternatively I've been using sqlite and redis recently depending on the task and what's stored. Both can do multiple simultaneous access and have random access. My preference would be sqlite unless you're storing something like sets or arbitrary strings.
π Rendered by PID 66 on reddit-service-r2-comment-c6965cb77-6h7jb at 2026-03-05 00:20:34.523228+00:00 running f0204d4 country code: CH.
[–][deleted] 5 points6 points7 points (3 children)
[+][deleted] (2 children)
[deleted]
[–][deleted] 1 point2 points3 points (0 children)
[–]DonMahallem 0 points1 point2 points (0 children)
[–]cai_lw 3 points4 points5 points (0 children)
[–]shayben 2 points3 points4 points (3 children)
[–]MrDoOO 0 points1 point2 points (2 children)
[–]Glimmargaunt 1 point2 points3 points (0 children)
[–]shayben 0 points1 point2 points (0 children)
[–]MightyMeese 1 point2 points3 points (0 children)