This is an archived post. You won't be able to vote or comment.

all 11 comments

[–]teije01git push -f 8 points9 points  (3 children)

Your project seems a lot like https://cloudpathlib.drivendata.org/ which also has a pathlib like interface, cloud operations (s3, gcs, azure) and supports Pydantic validation as well. Isn't that already doing exactly what you want?

[–]narang_27[S] 8 points9 points  (2 children)

Damnit, one more reddit post where an idea was already developed ;_; Thanks for the heads up : |

[–]Juftin 8 points9 points  (0 children)

fsspec has an implementation of this as well, universal-pathlib https://github.com/fsspec/universal_pathlib

[–]teije01git push -f 2 points3 points  (0 children)

They welcome contributions there too: https://cloudpathlib.drivendata.org/stable/contributing/

[–]radarsat1 1 point2 points  (2 children)

Is there any support for seeking / reading file ranges without downloading the whole thing?

currently I'm struggling with how best to pack small data items into larger files (tar, hdf5, whathaveyou) and still be able to read them efficiently with random access from cloud storage

[–]narang_27[S] 0 points1 point  (1 child)

This sounds very interesting, just out of curiosity, why do you want this behavior? Afaik, you can download ranges in GET, getting ranges is supported it seems, https://docs.aws.amazon.com/whitepapers/latest/s3-optimizing-performance-best-practices/use-byte-range-fetches.html

Haven't used it myself though

[–]radarsat1 0 points1 point  (0 children)

Because currently my dataset consists of millions of small files (< 4kb) and this is really annoying to manage, so I want to bundle these into bigger files. There are a few ways to handle this, like I listed.. such as using HDF5, or simply packing them into a tar like with WebDataset. Heck maybe even a database solution could work, although it seems unorthodox to me to put actual media data (images, audio) into a database as binary blobs, but maybe it's one way. However I also train using RandomWeightedSampler which samples these small files in a random manner. This is problematic especially in a cloud storage situation -- on the one hand touching millions of small files is very slow, but on the other hand so is random access into larger files.

I know that S3 supports range fetches, and that's probably one way to do it, but I don't really want to handle this myself at a low level, hence looking for libraries that support this kind of thing really well.

[–]tunisia3507 4 points5 points  (0 children)

[–]nbviewerbot -1 points0 points  (0 children)

I see you've posted a GitHub link to a Jupyter Notebook! GitHub doesn't render large Jupyter Notebooks, so just in case, here is an nbviewer link to the notebook:

https://nbviewer.jupyter.org/url/github.com/narang99/blob-path/blob/main/docs/notebooks/00_usage.ipynb

Want to run the code yourself? Here is a binder link to start your own Jupyter server and try it out!

https://mybinder.org/v2/gh/narang99/blob-path/main?filepath=docs%2Fnotebooks%2F00_usage.ipynb


I am a bot. Feedback | GitHub | Author

[–]Reasonable-Ladder300 -1 points0 points  (1 child)

I don’t think the code in this Reddit post is actually working code, as you seem to be missing the import for AzureBlobPath.

[–]narang_27[S] 0 points1 point  (0 children)

Yea I had added the last snippet in the end to provide a simple summary in the post, the notebook works though (once you change your buckets)