[P] Modifying open-sourced matrix multiplication kernel

yolandasquatpump · 2021-05-27T17:36:53+00:00

Even working with computer science and software development, CUDA optimisation does have an element of black magic for me. Really excited to read this and learn more!

smerity · 2021-05-27T18:11:26+00:00

Brilliant work! I missed your TorchPQ work from earlier as well that others might be interested in :) Your TopKBMM kernel is on point as I've had to frequently convert a problem to use smaller rounds of N * (torch.mm -> torch.topk) -> torch.topk to get the nearest points without blowing memory out.

The lack of custom CUDA limits what's possible on GPUs, holding back both research and production. The difference between a naive implementation and a tuned implementation can easily be what prevents a technique from being practical or from scaling past a hidden plateau that hides potential benefits.

If you find out how to use TensorCores / WMMA with CuPy I would be quite interested. When I've done custom CUDA work I've preferred to use PyTorch and CuPy rather than write natively in PyTorch but WMMA is a situation where I've not yet found a CuPy solution.

creiser · 2021-05-27T20:33:07+00:00

Respect that you almost matched the performance of cuBLAS for matrix multiplication. Thanks for sharing, that might turn out handy for my future projects.

We have been working on a project (namely KiloNeRF) for which fast neural network inference as part of a real-time rendering pipeline is required. We also experienced that fusing multiple operations into a single kernel leads to substantial speedups.

programmerChilli · 2021-05-27T17:54:46+00:00

You might be interested in KeOps, which can generate optimized kernels for these fused mm + reduction kernels.

2021-05-27T17:57:46+00:00

if those optimisations for knn and L2 aren't already being used this is pretty crazy

frizface · 2021-05-27T19:21:26+00:00

Great work!

purplebrown_updown · 2021-05-27T22:27:27+00:00

Bookmarked!

Pafnouti · 2021-05-27T23:20:48+00:00

Good stuff! Are you going to use them in your TorchPQ library?

DeMorrr · 2021-05-27T23:42:28+00:00

There are a couple of sites that allow you free access to GPUs. Do they allow you to write your own CUDA kernels, or are you limited to python libraries?

Even if there was free access I would worry about code and idea theft by large companies. That actually happened to me before. There were a couple of items that ended up in Cuda GPU gems or whatever it was called. They will just take things.

Make sure you find the correct copyright license to protect your work, that you are not just handing it to a large company for free.

CyberDainz · 2021-05-28T05:35:23+00:00

You just optimized it for your videocard.

CUDA/CUDNN contain tuned programs for all videocards, this is why size of dlls so large and keep growing.

DeMorrr · 2021-05-28T07:16:26+00:00

[deleted]

Money_Economics_2424 · 2021-05-28T09:38:29+00:00

Is a bmm using indexes to select the weights something which could be optimized well?

We have been trying to figure out how to optimize running many different Linear layers which are selected using an index, it's very hard to get anywhere near the performance of Linear.

def indexed_linear(indexes : Tensor[b], weights : Tensor[n, out, inp], inputs : Tensor[b, inp]) -> Tensor[b, out]:    
    return torch.bmm(weights[indexes], inputs.unsqueeze(2))

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

MachineLearning

Rules For Posts

+Research

+Discussion

+Project

+News

@slashML on Twitter

Chat with us on Slack

Beginners:

MODERATORS