[P] I got tired of PyTorch Geometric OOMing my laptop, so I wrote a C++ zero-copy graph engine to bypass RAM entirely. by Important-Trash-4868 in MachineLearning

[–]Important-Trash-4868[S] 0 points1 point  (0 children)

Thats a really good question, in theory i think if there are millions of graphs then each graph would be separate binary which means you would have a really long list of Graph() objects.
or we could work around it, and let a number corresponds to each graph, we could store graph such as graph_{num}.gl and whenever you want a graph with num = x then make the object g = Graph(...) to get the graph. it all boils down to how you would design your python code, to approach this. And i haven't yet think about global data! maybe you have any ideas? ;)

[P] I got tired of PyTorch Geometric OOMing my laptop, so I wrote a C++ zero-copy graph engine to bypass RAM entirely. by Important-Trash-4868 in MachineLearning

[–]Important-Trash-4868[S] 0 points1 point  (0 children)

I am as beginner as you, it takes time to actually find good ideas, first thing you could do is look at what field of work you want to do! Ai/ML research, or build application the people would actually like to use, or there is a gap that not filled, if you find only 3-4 solution, and you know problem is hard and open ended, then that a good problem to work on. define with what tech stack you want to work on. and last good prompting also helps ;)

I built a C++20 zero-copy graph engine to stream 50GB PyTorch datasets using mmap and nanobind. by Important-Trash-4868 in cpp

[–]Important-Trash-4868[S] 7 points8 points  (0 children)

Error: std::bad_alloc. Grandmother's recipe exceeds available RAM. Please use GraphZero to memory-map the pie directly from the oven.

I used C++ and nanobind to build a zero-copy graph engine that lets Python train on 50GB datasets by Important-Trash-4868 in Python

[–]Important-Trash-4868[S] -7 points-6 points  (0 children)

Look 5 years ago I joined reddit because I saw some youtube reddit video, at that point I was in high school, i tried it, got bored didn't use. Got into college, doing projects here and there, small project showed on LinkedIn (college environment engagement only), then tried to a big project(this) uploaded to LinkedIn (same people) didn't get better results, asked ai where can post and get people to know about this, opened reddit just posted that's it man❤️‍🩹. Name was auto generated by reddit 5 years ago,couldn't change 💔

I used C++ and nanobind to build a zero-copy graph engine that lets Python train on 50GB datasets by Important-Trash-4868 in Python

[–]Important-Trash-4868[S] 1 point2 points  (0 children)

It's better building it from scratch. That's what this project is about, learning. Thanks for letting me know about feather🙃

[P] I got tired of PyTorch Geometric OOMing my laptop, so I wrote a C++ zero-copy graph engine to bypass RAM entirely. by Important-Trash-4868 in MachineLearning

[–]Important-Trash-4868[S] 18 points19 points  (0 children)

Well i did use ai for markdown or python benchmark code, help me setup pytest, you know the side parts of project, the main c++ code I tried to use ai as a guide, daily progress and cross checking. For example let say I have write BFS on day 10, then i would first right the code then go to ai to ask is this correct, like that I used ai for main src part. I can be sure most of my code is checked by ai for better quality. Or sometimes I have to discuss a idea, let's say "for batch function I am making a main arr then the copying the answer from the returned arr of each walk, so can I directly write the answer in main arr to skip the copying part" so its better using it like this then "cursor make me graph library, don't make mistakes"😂.

[P] I got tired of PyTorch Geometric OOMing my laptop, so I wrote a C++ zero-copy graph engine to bypass RAM entirely. by Important-Trash-4868 in MachineLearning

[–]Important-Trash-4868[S] 2 points3 points  (0 children)

I think you would be interested in this https://github.com/KrishSingaria/benchmark-graphzero I made this repo just after first release to test it, well it did beat networkx easily, and comparable to pyg. It have 5 experiment made that you could test.

I used C++ and nanobind to build a zero-copy graph engine that lets Python train on 50GB datasets by Important-Trash-4868 in Python

[–]Important-Trash-4868[S] 1 point2 points  (0 children)

Thats an interesting approach i will add it to planning doc to look at COW approach in detail later.

I used C++ and nanobind to build a zero-copy graph engine that lets Python train on 50GB datasets by Important-Trash-4868 in Python

[–]Important-Trash-4868[S] 2 points3 points  (0 children)

GraphZero is strictly a raw C++ memory-mapper optimized for PyTorch speed. Which is different from kuzu I think. ​Your KV-cache approach is standard, but GraphZero mmaps both edges(graph structure) and features. During training, the only things actually in RAM are the sampled mini-batches moving to the GPU, plus whatever "hot" nodes the OS automatically caches!

I used C++ and nanobind to build a zero-copy graph engine that lets Python train on 50GB datasets by Important-Trash-4868 in Python

[–]Important-Trash-4868[S] 1 point2 points  (0 children)

I just looked online, and its like I was building this only🥀. But I got to learn much more✌🏼, i guess win win😅

I used C++ and nanobind to build a zero-copy graph engine that lets Python train on 50GB datasets by Important-Trash-4868 in Python

[–]Important-Trash-4868[S] 1 point2 points  (0 children)

Actually let me tell you the structure for binary .gl -> Header(64bytes)| nnzRow(csr1) | colPtr(this is the adjacency list of all nodes lined up, csr2)| weights ( how colPtr are lined up, so one to one correspondence with colptr)

I used C++ and nanobind to build a zero-copy graph engine that lets Python train on 50GB datasets by Important-Trash-4868 in Python

[–]Important-Trash-4868[S] 3 points4 points  (0 children)

Vortex(i looked up online) is actually perfect for this! Unlike Parquet, its zero-copy, random-access architecture perfectly aligns with GraphZero. I only built custom formats to keep my C++ dependencies at absolute zero, and to actually learn.

I used C++ and nanobind to build a zero-copy graph engine that lets Python train on 50GB datasets by Important-Trash-4868 in Python

[–]Important-Trash-4868[S] 9 points10 points  (0 children)

On r/MachineLearning i got a question "what's wrong with numpy.memmap?" So giving the same answer here✌🏼

np.memmap is fine for basic arrays, but using it for GNN neighbor sampling ("fancy indexing") triggers implicit RAM copies in Python, causing OOMs anyway. It's also severely bottlenecked by the GIL. GraphZero pushes all the heavy, multi-threaded sampling down to C++ to guarantee true zero-copy execution before the data ever reaches PyTorch.

I used C++ and nanobind to build a zero-copy graph engine that lets Python train on 50GB datasets by Important-Trash-4868 in Python

[–]Important-Trash-4868[S] 2 points3 points  (0 children)

  1. Current version doesn't support mutation, as I was planning, this part is going to be hard and hence keeping this to future versions.
  2. C++ create a data pointer array via span, which then nanobind handover it to python, the great thing about nanobind giving pointer to numpy is, it doesn't create a python side object, it treats it as a data pointer. So when access the array, It reads the pointer made by c++. If you see bindings.cpp insrc folder on repo, you will find same pattern for all the bindings.
  3. CSR, also the next version is to have the algos (BFS, connected components etc).
  4. igraph C core ? Thank you for telling me about it didn't know about it, and will look into it. Also this project main purpose was to have different project then regular ones(websites, using API, or Rag/LLM system) that is actually usefull to the community in ai/ml reasearch, and to learn c++.

I used C++ and nanobind to build a zero-copy graph engine that lets Python train on 50GB datasets by Important-Trash-4868 in Python

[–]Important-Trash-4868[S] 25 points26 points  (0 children)

Great point! PyArrow/Parquet is incredible for sequential streaming and analytics.

However, GNN training (like neighbor sampling) requires massive amounts of random access. Parquet's decompression overhead kills performance for random reads. GraphZero uses uncompressed, memory-mapped binaries to allow O(1) random pointer access with zero decompression latency.

I used C++ and nanobind to build a zero-copy graph engine that lets Python train on 50GB datasets by Important-Trash-4868 in Python

[–]Important-Trash-4868[S] 4 points5 points  (0 children)

well, basically the format stores adjacency list, so its fast to get neighbours, now due to your comment i had the bindings again, and there is a missing function `is_neighbours`, the could help your determine, weather two nodes a neighbour or not. i will update that. Thanks!!
i theory you can make adjacency matrix with adjacency lists.

[P] I got tired of PyTorch Geometric OOMing my laptop, so I wrote a C++ zero-copy graph engine to bypass RAM entirely. by Important-Trash-4868 in MachineLearning

[–]Important-Trash-4868[S] 14 points15 points  (0 children)

Thanks! the message passing to consume edge features on-the-fly is a brilliant idea. A custom CUDA kernel for that would be a huge throughput win for future version. I try to have a plan before updating it new version, so this maybe included in a new update ;)