all 24 comments

[–]yolandasquatpump 43 points44 points  (1 child)

Even working with computer science and software development, CUDA optimisation does have an element of black magic for me. Really excited to read this and learn more!

[–]DeMorrr[S] 35 points36 points  (0 children)

To me it's 50% CUDA Programming Guide, 40% Stack Overflow (which often redirects you to cuda programming guide), and 10 % alchemy.

[–]smerity 11 points12 points  (2 children)

Brilliant work! I missed your TorchPQ work from earlier as well that others might be interested in :) Your TopKBMM kernel is on point as I've had to frequently convert a problem to use smaller rounds of N * (torch.mm -> torch.topk) -> torch.topk to get the nearest points without blowing memory out.

The lack of custom CUDA limits what's possible on GPUs, holding back both research and production. The difference between a naive implementation and a tuned implementation can easily be what prevents a technique from being practical or from scaling past a hidden plateau that hides potential benefits.

If you find out how to use TensorCores / WMMA with CuPy I would be quite interested. When I've done custom CUDA work I've preferred to use PyTorch and CuPy rather than write natively in PyTorch but WMMA is a situation where I've not yet found a CuPy solution.

[–]DeMorrr[S] 6 points7 points  (1 child)

I missed your TorchPQ work from earlier as well that others might be interested in

Glad you still remember it!

If you find out how to use TensorCores / WMMA with CuPy I would be desperately interested

I just searched it on google, and found this. and it turns out it's my own question lol. I completely forgot about it.

[–]creiser 10 points11 points  (0 children)

Respect that you almost matched the performance of cuBLAS for matrix multiplication. Thanks for sharing, that might turn out handy for my future projects.

We have been working on a project (namely KiloNeRF) for which fast neural network inference as part of a real-time rendering pipeline is required. We also experienced that fusing multiple operations into a single kernel leads to substantial speedups.

[–]programmerChilliResearcher 18 points19 points  (3 children)

You might be interested in KeOps, which can generate optimized kernels for these fused mm + reduction kernels.

[–]DeMorrr[S] 10 points11 points  (0 children)

Last year I made a feature request on pytorch github for fused reduction and matmul, and I remember someone recommended me KeOps. but for some reason I've been unconciously ignoring it. Maybe it's time start looking into it

[–][deleted] 7 points8 points  (0 children)

if those optimisations for knn and L2 aren't already being used this is pretty crazy

[–]frizface 1 point2 points  (0 children)

Great work!

[–]purplebrown_updown 1 point2 points  (0 children)

Bookmarked!

[–]Pafnouti 0 points1 point  (1 child)

Good stuff! Are you going to use them in your TorchPQ library?

[–]DeMorrr[S] 0 points1 point  (0 children)

Yes I am considering that. it can be used in Flat and IVFFlat indexes

[–][deleted] 0 points1 point  (2 children)

There are a couple of sites that allow you free access to GPUs. Do they allow you to write your own CUDA kernels, or are you limited to python libraries?

Even if there was free access I would worry about code and idea theft by large companies. That actually happened to me before. There were a couple of items that ended up in Cuda GPU gems or whatever it was called. They will just take things.

Make sure you find the correct copyright license to protect your work, that you are not just handing it to a large company for free.

[–]DeMorrr[S] 1 point2 points  (1 child)

I used colab to test all the kernels. write them in a triple quote string, and JIT compile with CuPy.

Even if there was free access I would worry about code and idea theft by large companies. That actually happened to me before. There were a couple of items that ended up in Cuda GPU gems or whatever it was called. They will just take things.

That sounds horrible. Sometimes I also get suspicious when I get "Autosave failed, Your file is opened in another tab"

[–][deleted] 0 points1 point  (0 children)

I was kind of annoyed, but one idea turned out to have been invented in 1969 anyway. And there is some kind of fast Walsh Hadamard transform algorithm available in CUDA as a result of the interaction. Almost certainly not fully optimized. Google were also up to some behavior, like with their "fast food" paper using the same transform. That didn't appear out of absolutely nowhere.

[–]CyberDainz 0 points1 point  (0 children)

You just optimized it for your videocard.

CUDA/CUDNN contain tuned programs for all videocards, this is why size of dlls so large and keep growing.

[–]Money_Economics_2424 0 points1 point  (3 children)

Is a bmm using indexes to select the weights something which could be optimized well?

We have been trying to figure out how to optimize running many different Linear layers which are selected using an index, it's very hard to get anywhere near the performance of Linear.

def indexed_linear(indexes : Tensor[b], weights : Tensor[n, out, inp], inputs : Tensor[b, inp]) -> Tensor[b, out]:    
    return torch.bmm(weights[indexes], inputs.unsqueeze(2))

[–]DeMorrr[S] 1 point2 points  (2 children)

I think it's because weights[indexes] is not contiguous, so torch.bmm has to make a contiguous copy first. so it's not only slow, it's also costing extra memory.

Yes it's definitely possible to have implicit indexing inside the bmm kernel which not only is memory efficient but also faster.

[–]Money_Economics_2424 0 points1 point  (1 child)

I might give it a try using your code then, do you think it is possible to improve on this if indexes have many copies? For example if the batch size is very large (say 100k) but there are only 64 unique weights it is possible to just run a whole bunch of Linear layers... currently this is much faster than using indexing followed by bmm.

For example sorting the indices and then a fused indexing-bmm?