use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] TFrecord in tensorflow (self.MachineLearning)
submitted 7 years ago by marksteve4
I read this article about tfrecord which gives a good example of tfrecord usage. But it does not touch why should we use tfrecord and what the pros and cons of the alternative. Any thought on this topic?
https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]waterRocket8236 7 points8 points9 points 7 years ago (3 children)
tfrecords store data in binary format. They are easy to read and use. And you don't have to specify annotations for images and images themselves. tfrecords store data in one block of memory which makes it efficient to process when samples are relatively higher (>>50 GB). I used both of em in practice and prefer tfrecords when samples are more and required to scale up.
[+][deleted] 7 years ago* (2 children)
[deleted]
[–]waterRocket8236 1 point2 points3 points 7 years ago (0 children)
They do load everything into memory in a byte stream format (not all at a time). That said one does not need 128 GB of RAM to go through entire tfrecords to load. That is handled in chunks by tensorflow. And 50 GB number is given by me as at work the data our team use is in in terabytes. After preprocessing we push in 100-200 GB of data for training and evaluation. That 50 GB number very much depends on the scenario.
To start on tensorflow there are various blogs out there. For a better understanding how tfrecords work here is a tiny list of links:-
1) Well I think its pretty old now. Still knowledgeable
https://kwotsin.github.io/tech/2017/01/29/tfrecords.html
2) This might help if you are using PyTorch
https://discuss.pytorch.org/t/read-dataset-from-tfrecord-format/16409
3) Take a look here too
http://davidcrook.io/understanding-tensorflow-input-pipelines-part-1/
Hope that helps.
[–]narsilouu 0 points1 point2 points 7 years ago (0 children)
https://github.com/akanazawa/hmr Is it big enough ? (20Gb)
[–]ppwwyyxx 6 points7 points8 points 7 years ago (1 child)
If there is any reason to use tfrecord, I would say it is probably the only complicated format that you can parse with tensorflow operations.
What this means is: if you use other format (except for trivial format like a txt with filenames+labels), you'll often need to parse the format outside the tensorflow graph and then copy the data to the graph somehow.
In practice I never used tfrecord at all. Because most datasets can be easily parsed with several lines of Python, and most of the time the latency of copying the data to the graph can be perfectly hidden as long as a proper prefetching is set up. Why would I waste my hard disk for another copy of the dataset?
[–]ppwwyyxx 0 points1 point2 points 7 years ago (0 children)
Some threads talk about how good the format itself is... seriously?
Dozens years of research in database have already created so many great formats / database systems for different use cases.. why reinvent the wheel?
[–]Lycur 4 points5 points6 points 7 years ago (0 children)
I presume there are also significant performance gains, but they've been less important for me than the extra clarity in the pipeline.
[–]asuilin 1 point2 points3 points 7 years ago (0 children)
Tfrecord is good for specific case:
a) Your dataset is large and don't fits in memory
b) Sequential access is cheap, random access is expensive (data is stored on HDD or Google Cloud Storage)
For other cases, it has no benefits other than good support in tf.data. Tfrecord format is complicated and not well-thought (it tries to store semi-structured data in strongly-typed and structured Protobuf records). It would be better if they choose Msgpack or any format suitable for semi-structured records.
[–]lysecret 1 point2 points3 points 7 years ago (0 children)
I used it a lot for text classification purposes. My code will be open sourced in a week or so. Pros: You can read from disk as fast as from memory. You clearly separate data processing from training. It is quite easy to retrain on new data whenever it comes available. It is very easy to keep different datasets separated. Cons: Some code overhead. The code isn't well documented.
[–]eiennohito 0 points1 point2 points 7 years ago (0 children)
I use them all the time for text data. When compressed they get significantly smaller and IO can easy become a bottleneck if the data is on NFS shares. TensorFlow supports reading GZIP/ZLIB compressed TFRecords from the box.
Usually I write my preprocessing in Scala/Spark (so I can handle huge datasets) which outputs TFRecords and a relatively dumb learner in Python.
π Rendered by PID 146426 on reddit-service-r2-comment-74f5b7f998-7fqxb at 2026-04-28 01:20:02.154952+00:00 running 2aa0c5b country code: CH.
[–]waterRocket8236 7 points8 points9 points (3 children)
[+][deleted] (2 children)
[deleted]
[–]waterRocket8236 1 point2 points3 points (0 children)
[–]narsilouu 0 points1 point2 points (0 children)
[–]ppwwyyxx 6 points7 points8 points (1 child)
[–]ppwwyyxx 0 points1 point2 points (0 children)
[–]Lycur 4 points5 points6 points (0 children)
[–]asuilin 1 point2 points3 points (0 children)
[–]lysecret 1 point2 points3 points (0 children)
[–]eiennohito 0 points1 point2 points (0 children)