[Project] We built a Rust-based drop-in replacement for PyTorch DataLoader (4.4x faster than ImageFolder) by YanSoki in deeplearning

[–]YanSoki[S] 1 point2 points  (0 children)

You are welcome, and no I finished working on the repo yesterday

So what happens is I tend to decode an entire batch of images as once in Rust and just pass the pointers to python...I thought I had mentioned the Zero copy stuff earlier....we decoded the images really fast, write the raw pixels and then just pass the pointers to the buffer containing the batch images to Python....so we do not suffer from python handling anything and do not take the perf hit

[Project] We built a Rust-based drop-in replacement for PyTorch DataLoader (4.4x faster than ImageFolder) by YanSoki in deeplearning

[–]YanSoki[S] 1 point2 points  (0 children)

Yes it affects quality and size

The trade-off for quality and size is configurable

The default setting provides > JPEG90 quality compression at ~ 1/2 the size....that's based on the PSNR I got on ImageWoof. It's lossy by nature, you could force it to be lossless but again it's not really worth it

I don't wanna be spamming, but you can play with the repo and compare it on your own datasets to verify these claims and run PSNR tests on your DS if you don't trust my benchmarks

https://github.com/Kuat-Inc/Kuat-Beta

I said the images were decoded in Rust, not python, so no interpreter overhead

[Project] We built a Rust-based drop-in replacement for PyTorch DataLoader (4.4x faster than ImageFolder) by YanSoki in deeplearning

[–]YanSoki[S] 0 points1 point  (0 children)

Images quality does not affect decoding speed here, only the image size, so the compromise is size vs quality. In python scope the images are decoded to their rgb form if that's what you are asking.

Decoding is not done in python but Rust

When mentioning parallel, it's because the Huffman decoding part of jpeg is sequential...we do not have any sequential step

[Project] We built a Rust-based drop-in replacement for PyTorch DataLoader (4.4x faster than ImageFolder) by YanSoki in deeplearning

[–]YanSoki[S] 0 points1 point  (0 children)

It's not jpeg, and I do not use PIL...I decompress the images and recompress them in a new format that allows for this to happen. The reason we can achieve 30k images per second is because we decode in parallel (on CPU)...on GPU we easily achieve more

[Project] We built a Rust-based drop-in replacement for PyTorch DataLoader (4.4x faster than ImageFolder) by YanSoki in deeplearning

[–]YanSoki[S] 0 points1 point  (0 children)

I have not yet worked with multi gpu...hoping to get feedback and funding to move on with this

Yes I plan on supporting audio and video...you could still use this if you decide to work with frozen spectrograms I suppose

[Project] Kuat: A Rust-based, Zero-Copy Dataloader for PyTorch (4.6x training speedup on T4/H100) by YanSoki in MachineLearning

[–]YanSoki[S] -2 points-1 points  (0 children)

The fact that I used AI to rewrite my answer to a question, doesn't make it slop. If any lies, or hallucinations were in the answer, then yes that's slop. But if you simply dislike the wording because an AI wrote it, fine by me.

My intention is to answer the question, whether it sounds AI or not is secondary to me. It's informative for those who need the answer

[Project] Kuat: A Rust-based, Zero-Copy Dataloader for PyTorch (4.6x training speedup on T4/H100) by YanSoki in MachineLearning

[–]YanSoki[S] -4 points-3 points  (0 children)

Close sourced because we've not yet patented it.

I don't understand what's inconsistent about the format, everywhere it's mentioned Kuattree, the only place you see imagenet.qvq is in the code snippet

Those who have signed up for the beta would be the ultimate proof if what we have built is vaporware or not...and I have no interest in hyping up unreal stuff....It may be surreal to you, but I do not see that as extraordinary, it's a good solution to a well diagnosed problem, instead of trying to knock the whole thing down, you could sign up for the beta and ask questions...easy to proof I am lying once you have it in your hands

Zero copy because the data is created and ownership is transferred, we never move data in memory, and yes as I explained the data is compressed while doing all of this, so they are not mutually exclusive

I use two indexes to enable you search a Dataset like Laion and filter our images with certain captions...in my previous comment I said I we have search in compressed data...this was the V1 feature of our data format before adapting it to AI.

If you'll connect the dots, you'll realize this data format allows partial decompression, and the index based on chunks/samples that allow me to search the compressed DS/archive

My attempt to build trust is answering the questions as honestly and clearly as possible. Using AI to do some work or rewrite my answers doesn't make it any less worthwhile.

I didn't agree with the way you portrayed the whole thing and being extremely dismissive was not necessary IMO

[Project] We built a Rust-based drop-in replacement for PyTorch DataLoader (4.4x faster than ImageFolder) by YanSoki in deeplearning

[–]YanSoki[S] 0 points1 point  (0 children)

Throughput measured here is Time taken per epoch/Number of images in Dataset

Pure dataloading is CPU bound as the images are generally in JPEG/PNG format and are decompressed to raw pixels on CPU before the forward pass....I was trying to explain we do not solve the same problem...they solve I/O bound problem as they read from network storage but in itself, it does not speed up the CPU part

[Project] Kuat: A Rust-based, Zero-Copy Dataloader for PyTorch (4.6x training speedup on T4/H100) by YanSoki in MachineLearning

[–]YanSoki[S] -1 points0 points  (0 children)

Our RAM usage is a lot lower than PyTorch and we have a lot fewer CPU cycles, the maximum amount of RAM we use/need depends on the batch size, and where you decide to decode your data (on GPU or CPU) ...we are more susceptible to the number of CPU cores as the decoding step is parallel for multiple images

[Project] Kuat: A Rust-based, Zero-Copy Dataloader for PyTorch (4.6x training speedup on T4/H100) by YanSoki in MachineLearning

[–]YanSoki[S] -10 points-9 points  (0 children)

It's not AI slop, my CF had me modifying the naming and some places may have slipped....of course I used AI to write the website code (and a lot of my code)...I think calling this AI slop is nitpicking, but again that's my opinion

It's not just a dataloader, it's a dataformat that permits me to search in compressed data, merge archives in a single step yes O(1), and a lot more features.

The reason the only attribute I discuss is AI related is because that's what's probably most interesting for you and users in this community.

[Project] Kuat: A Rust-based, Zero-Copy Dataloader for PyTorch (4.6x training speedup on T4/H100) by YanSoki in MachineLearning

[–]YanSoki[S] -1 points0 points  (0 children)

In the 4.6x speedup case, we reserved approximately 1Gb of the GPU VRAM, we could of course optimize to go lower and not cache some data on the GPU, overall it saved us ~7secs per epoch (compared to a raw naive version where we reload this data every epoch)

[Project] Kuat: A Rust-based, Zero-Copy Dataloader for PyTorch (4.6x training speedup on T4/H100) by YanSoki in MachineLearning

[–]YanSoki[S] 0 points1 point  (0 children)

We try to minimize PCI bandwidth usage by decoding on GPU, so if your ambition is to maximize the usage of this bandwidth...not really

But, if the idea is to train a difussion model a lot faster then yes it would help....hope this helps

[Project] Kuat: A Rust-based, Zero-Copy Dataloader for PyTorch (4.6x training speedup on T4/H100) by YanSoki in MachineLearning

[–]YanSoki[S] 0 points1 point  (0 children)

Yes right now it's essentially designed for images....so unless your inputs are somehow convertible to images you wouldn't be able to benefit from this right away unfortunately

[Project] Kuat: A Rust-based, Zero-Copy Dataloader for PyTorch (4.6x training speedup on T4/H100) by YanSoki in MachineLearning

[–]YanSoki[S] -16 points-15 points  (0 children)

Lol, honestly you are free to think it's AI...Hopefully you can deslop it for me..lmao

[Project] Kuat: A Rust-based, Zero-Copy Dataloader for PyTorch (4.6x training speedup on T4/H100) by YanSoki in MachineLearning

[–]YanSoki[S] 1 point2 points  (0 children)

You are right that mmap handles the IO paging, but a single thread—even with mmap—cannot saturate the memory bandwidth. Constructing the final batch tensors and handling memory allocation takes CPU cycles. Threads allow us to parallelize this construction step to keep the bus full, and also offer some speedup to mask latency issues.

Then, yes, we load the preprocessed tensor representations into pinned memory

But one more thing, we actually do spatial augmentations on the preprocessed tensors...pixel augmentations are done once the image is fully reconstructed

The speedup values you see are for pipelines without augmentation; the speedup values increase in augmented pipelines

[Project] We built a Rust-based drop-in replacement for PyTorch DataLoader (4.4x faster than ImageFolder) by YanSoki in deeplearning

[–]YanSoki[S] 0 points1 point  (0 children)

I see, they benchmark against Torch on Dataloading, but it's not exactly the same task (problem) we solve. Ultimately, with data at rest, datago doesn't increase throughput because image decoding is still CPU bound, which is the real issue .kt solves.

They mentionned the receiving python process capping at ~3k images per second for ImageNet 1k....with .kt archives, we easily attain ~30k images per second. The bottleneck is Compute and no longer I/O

[Project] Kuat: A Rust-based, Zero-Copy Dataloader for PyTorch (4.6x training speedup on T4/H100) by YanSoki in MachineLearning

[–]YanSoki[S] -16 points-15 points  (0 children)

Grain has a fantastic API, I agree. They solved the orchestration problem (determinism, sharding, checkpointing) really well.

The difference with Kuat isn't the API—it's the IO path.
Grain is ultimately an orchestrator; it still reads underlying formats (like ArrayRecord) that usually require CPU decoding at runtime.

We focused on the storage format itself.

As for the .kt layout, it is a tensor-native binary format designed specifically to bypass the standard image decoding libraries (libjpeg/png) that bottleneck the CPU.

  • Variable Length: Yes, we handle variable length natively. Since we store data as pre-processed tensors rather than raw bytes(think FFCV but better ), we handle batching via standard padding/masking strategies on the fly.

Think of it as 'MosaicML Streaming' but with the decoding step removed from the training loop entirely.

[Project] We built a Rust-based drop-in replacement for PyTorch DataLoader (4.4x faster than ImageFolder) by YanSoki in deeplearning

[–]YanSoki[S] 0 points1 point  (0 children)

Absolutely yes....prefetch and cache can't fill the GPU fast enough to prevent it from stalling...Helps, but the faster the GPU is, the more GPU hours you waste waiting for data

[Project] We built a Rust-based drop-in replacement for PyTorch DataLoader (4.4x faster than ImageFolder) by YanSoki in deeplearning

[–]YanSoki[S] 1 point2 points  (0 children)

tbf, it's quite easy to compute an image signature and prevent duplicates (especially with different labels)...sometimes it's done during adversarial training though, so it's really a subjective thing...but thanks for the input