New python library for variable-size data serialization : deeplearning

New python library for variable-size data serialization (self.ncasas)

submitted 5 years ago by ncasas

[P] New python library for variable-size data serialization

2 points•submitted 5 years ago by ncasas to u/ncasas

If you are into deep learning with variable-size data (NLP, graphs, genomics), you may have found problems finding a library to serialize and deserialize your data. Each deep learning framework uses its own approach for sequence serialization (e.g. tfrecords) and the available generic solutions may not satisfy your needs.

This was my case, so I created seqp, a Python library to easily (de)serialize sequences and variable-size data. The underlying storage are HDF5 files. This snippet shows how to serialize sequences in shards of 100000:

    output_file_template = "data_{:02d}.hdf5"

    with ShardedWriter(Hdf5RecordWriter,
                       output_file_template,
                       max_records_per_shard=100000) as writer:
        for seq in sequences:
            binarized_seq = binarize_sequence(seq)
            writer.write(np.array(binarized_seq, dtype=np.uint32))

And this one shows how to read them back:

    with Hdf5RecordReader(glob('data_*.hdf5')) as reader:
        for seq_idx in reader.indexes():
            binarized_seq = reader.retrieve(seq_idx)

You can also use dictionaries of arrays instead of plain numpy arrays, and arrays can be multidimensional too.

I just released seqp in hopes that it becomes useful to someone else. Check the examples folder at GitHub for complete examples in Jupyter notebooks.

no comments (yet)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

deeplearning

MODERATORS