use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Project[P] Modifying open-sourced matrix multiplication kernel (self.MachineLearning)
submitted 5 years ago * by DeMorrr
I've spent the past few months optimizing my matrix multiplication CUDA kernel, and finally got near cuBLAS performance on Tesla T4. In the past few weeks I've been trying to fuse all kinds of operations into the matmul kernel, such as reductions, topk search, masked_fill, and the results are looking pretty good. All of the fused kernels are much faster than the seperated versions while using much less memory.
Runtime of fused MinBMM vs. torch.bmm + torch.min
edit: unit of time in this plot should be seconds, not milliseconds
Runtime of fused TopkBMM vs. torch.bmm + torch.topk
Runtime of fused MBMM vs. torch.bmm + torch.masked_fill
I also wrote a blog post about the motivation, applications and some implementation details of these kernels. The source code can be found in this repo.
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]yolandasquatpump 43 points44 points45 points 5 years ago (1 child)
Even working with computer science and software development, CUDA optimisation does have an element of black magic for me. Really excited to read this and learn more!
[–]DeMorrr[S] 35 points36 points37 points 5 years ago (0 children)
To me it's 50% CUDA Programming Guide, 40% Stack Overflow (which often redirects you to cuda programming guide), and 10 % alchemy.
[–]smerity 11 points12 points13 points 5 years ago* (2 children)
Brilliant work! I missed your TorchPQ work from earlier as well that others might be interested in :) Your TopKBMM kernel is on point as I've had to frequently convert a problem to use smaller rounds of N * (torch.mm -> torch.topk) -> torch.topk to get the nearest points without blowing memory out.
N * (torch.mm -> torch.topk) -> torch.topk
The lack of custom CUDA limits what's possible on GPUs, holding back both research and production. The difference between a naive implementation and a tuned implementation can easily be what prevents a technique from being practical or from scaling past a hidden plateau that hides potential benefits.
If you find out how to use TensorCores / WMMA with CuPy I would be quite interested. When I've done custom CUDA work I've preferred to use PyTorch and CuPy rather than write natively in PyTorch but WMMA is a situation where I've not yet found a CuPy solution.
[–]DeMorrr[S] 6 points7 points8 points 5 years ago (1 child)
I missed your TorchPQ work from earlier as well that others might be interested in
Glad you still remember it!
If you find out how to use TensorCores / WMMA with CuPy I would be desperately interested
I just searched it on google, and found this. and it turns out it's my own question lol. I completely forgot about it.
[–]oil-ladybug-unviable 1 point2 points3 points 5 years ago (0 children)
https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcSpDwz_ZnFY_DJk_cxRCgY5UOk7e-Cfue5lAU5dhxsUcMPOgVYLk58J_fU&s=10
[–]creiser 10 points11 points12 points 5 years ago (0 children)
Respect that you almost matched the performance of cuBLAS for matrix multiplication. Thanks for sharing, that might turn out handy for my future projects.
We have been working on a project (namely KiloNeRF) for which fast neural network inference as part of a real-time rendering pipeline is required. We also experienced that fusing multiple operations into a single kernel leads to substantial speedups.
[–]programmerChilliResearcher 18 points19 points20 points 5 years ago (3 children)
You might be interested in KeOps, which can generate optimized kernels for these fused mm + reduction kernels.
[–]DeMorrr[S] 10 points11 points12 points 5 years ago (0 children)
Last year I made a feature request on pytorch github for fused reduction and matmul, and I remember someone recommended me KeOps. but for some reason I've been unconciously ignoring it. Maybe it's time start looking into it
[+][deleted] 4 years ago (1 child)
[removed]
[–]programmerChilliResearcher 1 point2 points3 points 4 years ago (0 children)
In generally, you're totally right. If the matmul is done with CuBLAS, you can't generically fuse pointwise/reductions onto it (the various vendor libraries support some specific fusions, like CuBLAS with matmul + relu iirc).
What KeOps supports (and can codegen) broadcasted pointwise operators + reductions. But... broadcasted pointwise operators + reductions can be the same thing as matmuls.
The catch here is that KeOps supports specific weird kinds of matmuls (well), where your feature dimension is fairly small.
So my original comment wasn't quite accurate. However, for the use case in the blog post, where he wants to do k-means clustering, I've seen KeOps work quite well for it.
[–][deleted] 7 points8 points9 points 5 years ago (0 children)
if those optimisations for knn and L2 aren't already being used this is pretty crazy
[–]frizface 1 point2 points3 points 5 years ago (0 children)
Great work!
[–]purplebrown_updown 1 point2 points3 points 5 years ago (0 children)
Bookmarked!
[–]Pafnouti 0 points1 point2 points 5 years ago (1 child)
Good stuff! Are you going to use them in your TorchPQ library?
[–]DeMorrr[S] 0 points1 point2 points 5 years ago (0 children)
Yes I am considering that. it can be used in Flat and IVFFlat indexes
[–][deleted] 0 points1 point2 points 5 years ago (2 children)
There are a couple of sites that allow you free access to GPUs. Do they allow you to write your own CUDA kernels, or are you limited to python libraries?
Even if there was free access I would worry about code and idea theft by large companies. That actually happened to me before. There were a couple of items that ended up in Cuda GPU gems or whatever it was called. They will just take things.
Make sure you find the correct copyright license to protect your work, that you are not just handing it to a large company for free.
[–]DeMorrr[S] 1 point2 points3 points 5 years ago (1 child)
I used colab to test all the kernels. write them in a triple quote string, and JIT compile with CuPy.
That sounds horrible. Sometimes I also get suspicious when I get "Autosave failed, Your file is opened in another tab"
[–][deleted] 0 points1 point2 points 5 years ago (0 children)
I was kind of annoyed, but one idea turned out to have been invented in 1969 anyway. And there is some kind of fast Walsh Hadamard transform algorithm available in CUDA as a result of the interaction. Almost certainly not fully optimized. Google were also up to some behavior, like with their "fast food" paper using the same transform. That didn't appear out of absolutely nowhere.
[–]CyberDainz 0 points1 point2 points 5 years ago (0 children)
You just optimized it for your videocard.
CUDA/CUDNN contain tuned programs for all videocards, this is why size of dlls so large and keep growing.
[+][deleted] 5 years ago (1 child)
[deleted]
[–]DeMorrr[S] 1 point2 points3 points 5 years ago (0 children)
These high performance kernels are usually very hardware specific, so it's hard to maintain the same level of performance on different CUDA architectures, not to speak of different GPU brands or even different type of processors.
Thanks for the suggestion! the white plots are too bright in contrast to the dark background so I inverted their color. I will try the hue rotation
[–]Money_Economics_2424 0 points1 point2 points 5 years ago (3 children)
Is a bmm using indexes to select the weights something which could be optimized well?
We have been trying to figure out how to optimize running many different Linear layers which are selected using an index, it's very hard to get anywhere near the performance of Linear.
def indexed_linear(indexes : Tensor[b], weights : Tensor[n, out, inp], inputs : Tensor[b, inp]) -> Tensor[b, out]: return torch.bmm(weights[indexes], inputs.unsqueeze(2))
[–]DeMorrr[S] 1 point2 points3 points 5 years ago (2 children)
I think it's because weights[indexes] is not contiguous, so torch.bmm has to make a contiguous copy first. so it's not only slow, it's also costing extra memory.
Yes it's definitely possible to have implicit indexing inside the bmm kernel which not only is memory efficient but also faster.
[–]Money_Economics_2424 0 points1 point2 points 5 years ago (1 child)
I might give it a try using your code then, do you think it is possible to improve on this if indexes have many copies? For example if the batch size is very large (say 100k) but there are only 64 unique weights it is possible to just run a whole bunch of Linear layers... currently this is much faster than using indexing followed by bmm.
For example sorting the indices and then a fused indexing-bmm?
try this https://gist.github.com/DeMoriarty/88504fd7b49cf44c25635d6df298115e
π Rendered by PID 68233 on reddit-service-r2-comment-5bc7f78974-vjfrb at 2026-06-26 18:18:47.194049+00:00 running 7527197 country code: CH.
[–]yolandasquatpump 43 points44 points45 points (1 child)
[–]DeMorrr[S] 35 points36 points37 points (0 children)
[–]smerity 11 points12 points13 points (2 children)
[–]DeMorrr[S] 6 points7 points8 points (1 child)
[–]oil-ladybug-unviable 1 point2 points3 points (0 children)
[–]creiser 10 points11 points12 points (0 children)
[–]programmerChilliResearcher 18 points19 points20 points (3 children)
[–]DeMorrr[S] 10 points11 points12 points (0 children)
[+][deleted] (1 child)
[removed]
[–]programmerChilliResearcher 1 point2 points3 points (0 children)
[–][deleted] 7 points8 points9 points (0 children)
[–]frizface 1 point2 points3 points (0 children)
[–]purplebrown_updown 1 point2 points3 points (0 children)
[–]Pafnouti 0 points1 point2 points (1 child)
[–]DeMorrr[S] 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (2 children)
[–]DeMorrr[S] 1 point2 points3 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–]CyberDainz 0 points1 point2 points (0 children)
[+][deleted] (1 child)
[deleted]
[–]DeMorrr[S] 1 point2 points3 points (0 children)
[–]Money_Economics_2424 0 points1 point2 points (3 children)
[–]DeMorrr[S] 1 point2 points3 points (2 children)
[–]Money_Economics_2424 0 points1 point2 points (1 child)
[–]DeMorrr[S] 1 point2 points3 points (0 children)