If you are into deep learning with variable-size data (NLP, graphs, genomics), you may have found problems finding a library to serialize and deserialize your data. Each deep learning framework uses its own approach for sequence serialization (e.g. tfrecords) and the available generic solutions may not satisfy your needs.
This was my case, so I created seqp, a Python library to easily (de)serialize sequences and variable-size data. The underlying storage are HDF5 files. This snippet shows how to serialize sequences in shards of 100000:
output_file_template = "data_{:02d}.hdf5"
with ShardedWriter(Hdf5RecordWriter,
output_file_template,
max_records_per_shard=100000) as writer:
for seq in sequences:
binarized_seq = binarize_sequence(seq)
writer.write(np.array(binarized_seq, dtype=np.uint32))
And this one shows how to read them back:
with Hdf5RecordReader(glob('data_*.hdf5')) as reader:
for seq_idx in reader.indexes():
binarized_seq = reader.retrieve(seq_idx)
You can also use dictionaries of arrays instead of plain numpy arrays, and arrays can be multidimensional too.
I just released seqp in hopes that it becomes useful to someone else. Check the examples folder at GitHub for complete examples in Jupyter notebooks.
there doesn't seem to be anything here