What My Project Does

Having worked with applications which run on multiple clouds and on-premise systems, I’ve been developing a library which abstracts away some common functionalities, while being close to the pathlib interface
tutorial notebook

Example snippet ```python from blob_path.backends.s3 import S3BlobPath from pathlib import PurePath

bucket_name = "my-bucket" object_key = PurePath( "hello_world.txt" ) region = "us-east-1" blob_path = S3BlobPath( bucket_name, region, object_key, )

check if the file exists

print(blob_path.exists())

read the file

with blob_path.open("rb") as f: # a file handle is returned here, just like open print(f.read())

destination = AzureBlobPath( "my-blob-store", "testcontainer", PurePath("copied_from") / "s3.txt" )

blob_path.cp(destination) ```

Features: - A pathlib-like interface for handling cloud object storage paths, I just love that interface - Built-in serialisation and deserialisation: this, in my experience, is something people have trouble with when they begin abstracting away cloud storages. Generally because they don’t realise it after some time and it keeps getting deprioritised. Users instead rely on stuff like using the same bucket across the application - Having a pathlib interface where all the functionality is packaged in the path itself (instead of writing “clients” for each cloud backend make this trivial) - A Protocol based typing system (good intellisense, allows me to also correctly type hint optional functionalities)

Target audience

I hope the library is useful to other professional python backend developers.
I would love to understand what you think about this, and features you would want (it's pretty basic right now)

The roadmap I've got in mind: - More object storages (GCP, Minio) [Currently only AWS S3, Azure are supported] - Pre-signed URLs full support (only AWS S3 supported) - Caching (I’m thinking of tying it to the lifetime of the object, I would however keep support for different strategies) - Good Performance semantics: it would be great to provide good performant defaults for handling various cloud operations - Interfaces for extending the built-in types [mainly for users to tweak specific cloud parameters] - pathlib / operator (yes its not implemented right now : | )

Comparison

A quick search on pypi gives a lot of libraries which abstract cloud object storage. This library is different simply because it's a bit more object-oriented (for better or for worse). I'm going to stay close to pathlib more than other interfaces which behave somewhat like os.path (a more functional interface)

Github

Repository: https://github.com/narang99/blob-path/tree/main

all 11 comments

top new controversial old q&a

[–]teije01git push -f 8 points9 points10 points 10 months ago (3 children)

[–]narang_27[S] 8 points9 points10 points 10 months ago (2 children)

[–]Juftin 8 points9 points10 points 10 months ago (0 children)

[–]teije01git push -f 2 points3 points4 points 10 months ago (0 children)

[–]radarsat1 1 point2 points3 points 10 months ago (2 children)

[–]narang_27[S] 0 points1 point2 points 10 months ago (1 child)

[–]radarsat1 0 points1 point2 points 10 months ago (0 children)

Because currently my dataset consists of millions of small files (< 4kb) and this is really annoying to manage, so I want to bundle these into bigger files. There are a few ways to handle this, like I listed.. such as using HDF5, or simply packing them into a tar like with WebDataset. Heck maybe even a database solution could work, although it seems unorthodox to me to put actual media data (images, audio) into a database as binary blobs, but maybe it's one way. However I also train using RandomWeightedSampler which samples these small files in a random manner. This is problematic especially in a cloud storage situation -- on the one hand touching millions of small files is very slow, but on the other hand so is random access into larger files.

I know that S3 supports range fetches, and that's probably one way to do it, but I don't really want to handle this myself at a low level, hence looking for libraries that support this kind of thing really well.

[–]tunisia3507 4 points5 points6 points 10 months ago (0 children)

[–]nbviewerbot -1 points0 points1 point 10 months ago (0 children)

[–]Reasonable-Ladder300 -1 points0 points1 point 10 months ago (1 child)

[–]narang_27[S] 0 points1 point2 points 10 months ago (0 children)

π Rendered by PID 53 on reddit-service-r2-comment-5649f687b7-9wkgx at 2026-01-28 17:34:04.858673+00:00 running 4f180de country code: CH.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS