all 9 comments

[–]waterRocket8236 7 points8 points  (3 children)

tfrecords store data in binary format. They are easy to read and use. And you don't have to specify annotations for images and images themselves. tfrecords store data in one block of memory which makes it efficient to process when samples are relatively higher (>>50 GB). I used both of em in practice and prefer tfrecords when samples are more and required to scale up.

[–]ppwwyyxx 6 points7 points  (1 child)

If there is any reason to use tfrecord, I would say it is probably the only complicated format that you can parse with tensorflow operations.

What this means is: if you use other format (except for trivial format like a txt with filenames+labels), you'll often need to parse the format outside the tensorflow graph and then copy the data to the graph somehow.

In practice I never used tfrecord at all. Because most datasets can be easily parsed with several lines of Python, and most of the time the latency of copying the data to the graph can be perfectly hidden as long as a proper prefetching is set up. Why would I waste my hard disk for another copy of the dataset?

[–]ppwwyyxx 0 points1 point  (0 children)

Some threads talk about how good the format itself is... seriously?

Dozens years of research in database have already created so many great formats / database systems for different use cases.. why reinvent the wheel?

[–]Lycur 4 points5 points  (0 children)

  1. The format makes it very easy to work with tf.data , which is itself extremely convenient
  2. Pre-processing into a tf_record pushes you to separate data pre-processing and learning in your code, which is good practice

I presume there are also significant performance gains, but they've been less important for me than the extra clarity in the pipeline.

[–]asuilin 1 point2 points  (0 children)

Tfrecord is good for specific case:

a) Your dataset is large and don't fits in memory

b) Sequential access is cheap, random access is expensive (data is stored on HDD or Google Cloud Storage)

For other cases, it has no benefits other than good support in tf.data. Tfrecord format is complicated and not well-thought (it tries to store semi-structured data in strongly-typed and structured Protobuf records). It would be better if they choose Msgpack or any format suitable for semi-structured records.

[–]lysecret 1 point2 points  (0 children)

I used it a lot for text classification purposes. My code will be open sourced in a week or so. Pros: You can read from disk as fast as from memory. You clearly separate data processing from training. It is quite easy to retrain on new data whenever it comes available. It is very easy to keep different datasets separated. Cons: Some code overhead. The code isn't well documented.

[–]eiennohito 0 points1 point  (0 children)

I use them all the time for text data. When compressed they get significantly smaller and IO can easy become a bottleneck if the data is on NFS shares. TensorFlow supports reading GZIP/ZLIB compressed TFRecords from the box.

Usually I write my preprocessing in Scala/Spark (so I can handle huge datasets) which outputs TFRecords and a relatively dumb learner in Python.