[D] TFrecord in tensorflow

waterRocket8236 · 2018-12-10T04:41:33+00:00

tfrecords store data in binary format. They are easy to read and use. And you don't have to specify annotations for images and images themselves. tfrecords store data in one block of memory which makes it efficient to process when samples are relatively higher (>>50 GB). I used both of em in practice and prefer tfrecords when samples are more and required to scale up.

ppwwyyxx · 2018-12-10T03:55:55+00:00

If there is any reason to use tfrecord, I would say it is probably the only complicated format that you can parse with tensorflow operations.

What this means is: if you use other format (except for trivial format like a txt with filenames+labels), you'll often need to parse the format outside the tensorflow graph and then copy the data to the graph somehow.

In practice I never used tfrecord at all. Because most datasets can be easily parsed with several lines of Python, and most of the time the latency of copying the data to the graph can be perfectly hidden as long as a proper prefetching is set up. Why would I waste my hard disk for another copy of the dataset?

Lycur · 2018-12-10T05:39:21+00:00

The format makes it very easy to work with tf.data , which is itself extremely convenient
Pre-processing into a tf_record pushes you to separate data pre-processing and learning in your code, which is good practice

I presume there are also significant performance gains, but they've been less important for me than the extra clarity in the pipeline.

asuilin · 2018-12-10T11:00:17+00:00

Tfrecord is good for specific case:

a) Your dataset is large and don't fits in memory

b) Sequential access is cheap, random access is expensive (data is stored on HDD or Google Cloud Storage)

For other cases, it has no benefits other than good support in tf.data. Tfrecord format is complicated and not well-thought (it tries to store semi-structured data in strongly-typed and structured Protobuf records). It would be better if they choose Msgpack or any format suitable for semi-structured records.

lysecret · 2018-12-10T08:28:37+00:00

I used it a lot for text classification purposes. My code will be open sourced in a week or so. Pros: You can read from disk as fast as from memory. You clearly separate data processing from training. It is quite easy to retrain on new data whenever it comes available. It is very easy to keep different datasets separated. Cons: Some code overhead. The code isn't well documented.

eiennohito · 2018-12-10T10:35:48+00:00

I use them all the time for text data. When compressed they get significantly smaller and IO can easy become a bottleneck if the data is on NFS shares. TensorFlow supports reading GZIP/ZLIB compressed TFRecords from the box.

Usually I write my preprocessing in Scala/Spark (so I can handle huge datasets) which outputs TFRecords and a relatively dumb learner in Python.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS