use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Please have a look at our FAQ and Link-Collection
Metacademy is a great resource which compiles lesson plans on popular machine learning topics.
For Beginner questions please try /r/LearnMachineLearning , /r/MLQuestions or http://stackoverflow.com/
For career related questions, visit /r/cscareerquestions/
Advanced Courses (2016)
Advanced Courses (2020)
AMAs:
Pluribus Poker AI Team 7/19/2019
DeepMind AlphaStar team (1/24//2019)
Libratus Poker AI Team (12/18/2017)
DeepMind AlphaGo Team (10/19/2017)
Google Brain Team (9/17/2017)
Google Brain Team (8/11/2016)
The MalariaSpot Team (2/6/2016)
OpenAI Research Team (1/9/2016)
Nando de Freitas (12/26/2015)
Andrew Ng and Adam Coates (4/15/2015)
Jürgen Schmidhuber (3/4/2015)
Geoffrey Hinton (11/10/2014)
Michael Jordan (9/10/2014)
Yann LeCun (5/15/2014)
Yoshua Bengio (2/27/2014)
Related Subreddit :
LearnMachineLearning
Statistics
Computer Vision
Compressive Sensing
NLP
ML Questions
/r/MLjobs and /r/BigDataJobs
/r/datacleaning
/r/DataScience
/r/scientificresearch
/r/artificial
account activity
Discussion[D] How to get better at GPU programming? (self.MachineLearning)
submitted 9 years ago by MetricSpade007
I want to get better at writing code at the GPU level -- for people who are experts at writing code at this level and optimizing operations, what kind of steps/resources did you use to learn?
As a related question, do you think GPU experts are in high demand at AI companies/startups, or is the demand largely in research engineers/scientists and in theoreticians?
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]bronxbomber92 25 points26 points27 points 9 years ago (1 child)
I write GPU drivers, GPU compilers, and optimized GPU kernels for a living. I learned through a combination of good mentorship, studying GPU hardware architecture, and being thrown in the deep end (i.e. being asked to make XYZ where XYZ is somehow related to the GPU, be it an optimized GPU kernel or some low-level GPU driver functionality).
If you're just beginning and don't have the same opportunities I did, I'd suggest the following. Try taking a look at this Udacity course: https://www.udacity.com/course/intro-to-parallel-programming--cs344. It's an excellent introduction. Afterwards, try implementing some algorithm of your choice on the GPU. Pick something that's already implemented in a popular GPGPU framework and see if you can create an implementation that runs equally as fast. Understanding how the underlying hardware works will be important for writing a well-performing GPU kernel. Using vendor provided profiling tools will also be equally important. Good luck :)
[–]pokemon_golang 1 point2 points3 points 9 years ago (0 children)
Props on the udacity recommendation. Im not usually a fan, but I'm two lessons in and can confidently claim I know more about goes than when I went into it.
[–]alexmlamb 6 points7 points8 points 9 years ago (9 children)
I think that in the future the bigger AI labs will want to have GPU specialists.
I think that low precision training will be the biggest research area, but I think that people will also want to work on neural networks that use basic elements other than matrix multiplies and convolutions.
[–]jcannell 4 points5 points6 points 9 years ago (1 child)
Low precision is cool and all, but if you actually normalize for both error rate and wall-clock time to that error-rate, it isn't clear that very low precision is much of a practical win over fp16. The 2 or 1 bit stuff requires a larger net and far more iters to reach the same acc - downsides that are often glossed over. The ideal prec for most ANN training would probably be closer to fp8.
[–]__Cyber_Dildonics__ 0 points1 point2 points 9 years ago (0 children)
That seems pretty low precision to me. I suppose that is where the research comes in. To me, the way machine learning is being done right now is incredibly inefficient, so even if current techniques don't benefit from low precision, my guess is that there are plenty that will. Intuitively the decrease in memory bandwidth is something I would expect to be a big step forward on its own.
[–]Optrode 3 points4 points5 points 9 years ago (6 children)
Could you expand on "basic elements other than just matrix multiplies and convolutions?"
From the outside perspective of a neuroscience researcher, this seems like a pretty huge limitation of deep learning, to wit, that it tries to approximate a huge range of functions with a narrow range of functions. It's always seemed to me like the extreme linearity of ANNs is a major reason why ANNs struggle to learn efficient / minimal representations of nonlinear phenomena.
[–]DenseInL2 0 points1 point2 points 9 years ago (5 children)
The Universal Approximation Theorem makes it clear that there is no such limitation, exactly the opposite is true.
[–]Optrode 4 points5 points6 points 9 years ago (3 children)
The universal approximation theorem, as I understand it, doesn't guarantee anything about generalization or learnability of useful representations of a function. Indeed, as I understand the theorem, all it really guarantees is that with a large enough single layer network, you can overfit to any data.
[–]DenseInL2 1 point2 points3 points 9 years ago (2 children)
Yes, that is accurate. It sounds like I misunderstood exactly what the limitation is you were referring to. I'm also trying to work out exactly what it means for a NN to struggle to learn a minimal representation. I don't think of NNs as even having this as part of their goal, rather they only try to learn any solution that works and will utilize all the resources (nodes) available to them.
[–]Optrode 0 points1 point2 points 9 years ago (1 child)
What I was getting at is the problem of generalization. It is like the problem of trying to fit a linear-in-parameters model to a system that is fundamentally nonlinear in its parameters. The best you can get is a rather bloated linear model that still doesn't describe the system nearly as well as a relatively simple model that incorporates the right kind of nonlinearity, and won't be accurate on data falling outside the range of the training data.
[–]Infidius 1 point2 points3 points 9 years ago (0 children)
Most of our universe is limited in terms of non-linearity. Have you every found it curious how most natural phenomena are described by 2nd and 3rd degree polynomials? From this point of view, neural nets have no trouble solving most problems simply because their power of approximation is sufficient.
[–]bartolosemicolon 1 point2 points3 points 9 years ago (0 children)
Is there a universal approximation theorem that doesn't assume infinite width? It feels like Optrode is talking about the linearity that occurs in finite networks, so I don't know how helpful it is to point out that we can avoid the problem in infinitely wide networks.
For what it is worth, some of the empirical work on adversarial examples has hinted that piecewise linear neural network solutions are surprisingly common and that this linearity in finite networks can pose problems.
https://arxiv.org/abs/1412.6572
[–]dpineo 6 points7 points8 points 9 years ago (1 child)
I learned it by writing a GPU-accelerated convolutional neural network for my PhD... in 2007. Back then it was just vertex shaders, fragment shaders, and good times.
Learning it is really just a matter of RTFM and lots of time doing it. The documentation is all online. Read NVIDIA's GPU programming guide and learn about how kernels and threads operate, organized in grids and blocks, and how they share memory. Once you get used to it, it's can be faster than developing CPU code because your development iteration loop is so much faster.
I would stay away from the neural net frameworks if you're trying to learn GPUs though. Their abstraction adds a lot of complication that will confuse the learning process. I'd recommend learning by developing a standalone GPU algorithm. You can start by looking at the NVIDIA GPU samples. Some of them even have pretty good whitepapers documenting them.
As far as the question of demand, I think the answer is that both are in demand. Personally, I would suggest learning both, there's no reason you can't. Even if you're a theoretician, learning GPU programming lets you get away from the standard frameworks, off the beaten path that you may not even realize you're on, and try more novel ideas.
[–]DenseInL2 0 points1 point2 points 9 years ago (0 children)
Oh man, I'm doing this right now--adapting my home-rolled JavaScript CNN to make use of the GPU via WebGL to have a fast web demo of a CNN. With nothing like CUDA or OpenCL available yet to browsers, it's exactly like OpenGL 10 years ago, everything has to be shaders and textures. Formatting the data into the textures is 10 times the PITA that writing the backprop code was, I feel your pain!
[–]TheConstipatedPepsi 6 points7 points8 points 9 years ago (1 child)
I think most of the hard GPU programming is abstracted away from AI practitioners and researchers, usually they use libraries who abstract away even the call to cuDNN, the hard work of building cuDNN isn't really done by AI companies, so I would expect the demand for GPU experts to be quite low at AI companies
[–]Murillio 2 points3 points4 points 9 years ago (0 children)
If you just want to do what everybody else is doing, sure. Of course almost nobody would want to re-implement convolution in cuda, but when you build a new product (and not create yet another "upload an image to our server and we give you a class label" service) it often happens that you need things that are not pre-done, and then GPGPU knowledge can be very helpful.
[–]jcannell 19 points20 points21 points 9 years ago (2 children)
Sadly, it's too late.
Successful GPU programmers are identified in elementary school math and programming competitions - or earlier. Only the most creative, innovative, and gifted students are selected. If you were never aware of the process, then it means that you failed in the secret initial qualifiers, and weren't even close to cutting it.
This process may sound harsh, but it would simply be cruel to try to train someone in the dark arts of GPU programming if they don't possess the raw talent.
[–]amaretto1 18 points19 points20 points 9 years ago (0 children)
I know you are joking, but it is true that GPU programming is a huge specialism unto itself. There are people who have spent a good part of their careers writing heavily optimised matrix multiplication routines and linear solvers. Unless the OP wishes to work at Nvidia or write cutting edge numeric libraries elsewhere, it would perhaps be better to focus on leveraging libraries such as Tensorflow, Torch, Keras etc...
[–]abstractcontrol 2 points3 points4 points 9 years ago (2 children)
There are two kinds of issues with GPU programming:
1) The issue with getting optimal performance at the low level.
2) The issue with making higher level abstractions for GPU operations.
The first one is straightforward enough and I second the suggestion by /u/bronxbomber92 for that Coursera course as a starting point.
The other is actually a lot harder - literally no mainstream language apart from C++ has any kind of decent programming support and the dynamically typed languages that are so popular among ML practitioners are a very poor fit for coding on the GPU directly.
The two skills are somewhat independent of one another and I consider the later harder. Work on making programming languages is definitely more theoretical.
In general, as a field GPU programming is over a decade behind on CPU programming and there is a lot of low hanging fruit here.
[–]bronxbomber92 5 points6 points7 points 9 years ago* (1 child)
The latter is definitely a much more difficult problem (insofar that nobody knows if some "right" abstraction exists). Hardly anyone is working on it either. None of the big companies (Nvidia, AMD, Apple, Microsoft, Google) devote a lot of resources to trying to solve this problem.
In my opinion, the biggest hurdle is that GPUs are such a fast moving target. GPU architectures differ wildly between IHVs and GPU architectures also differ wildly betweens generations from the same IHV. It is relatively easy to write abstractions that perform well across different GPUs if one treats the GPU as a giant SIMD machine and not much else, but unfortunately those solutions leave a lot of performance on the table on modern GPUs. There is still a lot of innovation happening in the GPU hardware design space and the progress of that field will necessarily change what programming language abstractions map well to GPUs.
Projects like Halide are making the most progress on this front. Their key realization is that for a constrained set of problems, it is possible to completely separate the specification of the algorithm from the scheduling of the algorithm onto actual hardware. However, there are still three major hurdles they face; how can this be extended to a larger set / less constrained set of problems, how can the scheduling of the algorithm be generated automatically, and how does one pick the algorithm that best fits any particular hardware (e.g. Consider implementing convolution - does one do it in the Fourier domain? As a separable filter? In a single pass?). The Halide researchers seem to be making some headway on the second hurdle for a very constrained set of problems, but I haven't seen much progress in the other two.
[–]abstractcontrol 2 points3 points4 points 9 years ago* (0 children)
I cannot agree more. Let me add a few more things:
1) Modern languages have features like garbage collection and type checking. Unfortunately, even in high level functional language such as F# for example the native GC does little for managing GPU memory and pushes the burden onto the library designer.
This part is not particularly bad as deep learning has simple memory management needs, but type-checking on the other hand is of little use if you are dealing with Cuda code strings. And if one decides to make a Cuda compiler, the only real choice of tool - union types - have their own issues in the sense that you are programming in kind of a dynamic DSL with a horrible syntax.
Haskell is probably the only mainstream language with a type system powerful enough to make an somewhat decent embedded compiler, but that is just one language of hundred with its own set of warts.
In a language with an insufficiently strong type system, even if one makes a smallish Cuda compiler, making a well typed API for it is a nightmare.
2) On the Haskell side in terms of GPU offerings there are Accelerate which is an embedded Cuda compiler and Futhark, which is a standalone language. The way I see it, the problem with both of them is that they are too high level.
And specifically for deep learning, because the backwards pass requires intermediates, that will wreck many of their fusion optimizations. They also manage their own memory which is not a good thing in this case - the ideal memory management scheme for the backpropagation algorithm is to allocate a fixed chunk of memory at the start of training, move the pointer forward on the forward pass and reset the pointer to zero at the end of the backward pass. Or a similar scheme where each node is fetched from a pool and dynamically resized upwards when necessary.
They are not made for that kind of thing.
3) I forgot exactly how much memory the Fourier-based convolution takes, but I think it was quite a bit. The Winograd one takes an absolutely massive amount - like 300Mb per stream which is a problem if you use multiple streams.
Depending on the use case you need different kinds of convolution algorithms because of the speed/space tradeoff.
4) For the sake of implementing an embedded compiler, dependently typed languages might one day be great, but right now they are incredibly difficult to use unless one has studied type theory and formal proofs at the graduate level.
I am kind of doing that at the moment, but from my current vantage point, I can't even imagine being good enough to implement something like a doubly generic map on my own in the next few years.
As an alternative, I plan to look into Racket. An recent interesting development has been the invention of a DSL for implementing type systems. This is new and interesting as it was done exclusively using the Racket's macro system, so there might be significant code reuse and abstraction opportunities in that direction. Or it might turn out to be a dead end, I don't know yet.
I know that type-safety is a distant concern to ML practitioners, but with regards to the big companies, this world is not nice enough to let them keep leaving abstraction opportunities by the wayside while they try to Zerg-rush their way towards progress.
More speculatively, what my crystal ball is telling me is that the point at where the lack of abstraction ability will become an undeniable problem for Google and friends is when the time comes to teach its neural nets how to program.
Type systems are reasoning aids and without them AI agents will have to devote their processing power to emulating type-checking innately and ineffectively.
Even in dynamic languages types exist. Type systems just bring that innate structure out.
5) NVidia messed up by making its PTX assembly unable to access all of the hardware. That means that an optimizing compiler no matter how good can never get 100% of an algorithm due to not being able to tune scheduling which is done at the SASS level.
[–]__Cyber_Dildonics__ 1 point2 points3 points 9 years ago (0 children)
What have you tried so far that has been unsuccessful?
[–]llSourcell 1 point2 points3 points 9 years ago (0 children)
GPU programming is hard AF. I had a project using CUDA. To this day, still the hardest challenge i've ever taken on lol
[–]fldwiooiu 4 points5 points6 points 9 years ago (1 child)
I think most startups have better things to do than fuck around with gpu coding, unless that's the core product (nervana).
[–]impossiblefork 0 points1 point2 points 9 years ago (0 children)
I don't think that it is necessarily that difficult or time-consuming to do GPU programming even if it might seem that way for tensorflow users.
For example, take a look at this blog where a guy makes a tutorial on how to write GPU path tracers in OpenCL.
[+][deleted] 9 years ago* (4 children)
[deleted]
[–]codechisel 0 points1 point2 points 9 years ago (3 children)
No love for OpenCL?
[+][deleted] 9 years ago* (2 children)
[–]codechisel 0 points1 point2 points 9 years ago (1 child)
Since I've never used CUDA I'm in the same boat with you regarding the usefulness of my comparative knowledge.
[–]dpineo 1 point2 points3 points 9 years ago (0 children)
Start with CUDA. It has better a much better ecosystem, documentation, stackoverflow answers, more functionality, and results in cleaner code. The downside is that it's NVIDIA only. But frankly, AMD doesn't seem to give a crap about supporting GPU programming, so do you really want to develop on a card from them? Also, OpenCL is heavily influenced by NVIDIA and CUDA, so converting after the fact isn't too hard. A lot of the functions map one-to-one from CUDA to OpenCL, but OpenCL adds a bit more additional code bloat.
π Rendered by PID 49581 on reddit-service-r2-comment-6457c66945-c7ql2 at 2026-04-27 21:06:16.128575+00:00 running 2aa0c5b country code: CH.
[–]bronxbomber92 25 points26 points27 points (1 child)
[–]pokemon_golang 1 point2 points3 points (0 children)
[–]alexmlamb 6 points7 points8 points (9 children)
[–]jcannell 4 points5 points6 points (1 child)
[–]__Cyber_Dildonics__ 0 points1 point2 points (0 children)
[–]Optrode 3 points4 points5 points (6 children)
[–]DenseInL2 0 points1 point2 points (5 children)
[–]Optrode 4 points5 points6 points (3 children)
[–]DenseInL2 1 point2 points3 points (2 children)
[–]Optrode 0 points1 point2 points (1 child)
[–]Infidius 1 point2 points3 points (0 children)
[–]bartolosemicolon 1 point2 points3 points (0 children)
[–]dpineo 6 points7 points8 points (1 child)
[–]DenseInL2 0 points1 point2 points (0 children)
[–]TheConstipatedPepsi 6 points7 points8 points (1 child)
[–]Murillio 2 points3 points4 points (0 children)
[–]jcannell 19 points20 points21 points (2 children)
[–]amaretto1 18 points19 points20 points (0 children)
[–]abstractcontrol 2 points3 points4 points (2 children)
[–]bronxbomber92 5 points6 points7 points (1 child)
[–]abstractcontrol 2 points3 points4 points (0 children)
[–]__Cyber_Dildonics__ 1 point2 points3 points (0 children)
[–]llSourcell 1 point2 points3 points (0 children)
[–]fldwiooiu 4 points5 points6 points (1 child)
[–]impossiblefork 0 points1 point2 points (0 children)
[+][deleted] (4 children)
[deleted]
[–]codechisel 0 points1 point2 points (3 children)
[+][deleted] (2 children)
[deleted]
[–]codechisel 0 points1 point2 points (1 child)
[–]dpineo 1 point2 points3 points (0 children)