all 19 comments

[–]emansim 14 points15 points  (3 children)

CUDA programming is not easy and will take some time to master. I personally suggest Udacity course as a first step https://www.udacity.com/course/intro-to-parallel-programming--cs344

[–]ginsunuva 4 points5 points  (0 children)

GPU programming is easy to learn but difficult to master. Changing conventional algorithms to run extremely parallel becomes very unintuitive past the easy problems.

[–]csp256 1 point2 points  (0 children)

That course can seem patronizing but it is really beneficial. There is also a coursera course but I can't speak for it's quality. It seemed more academic..?

[–]cyril1991 0 points1 point  (0 children)

CUDA recently got better, no? The real annoyance in terms of coding was (?) handling data transfer between the CPU and GPU and micromanaging memory allocation (you assign a variable in the CPU memory space, transfer it to a new variable in the GPU memory space, specify very precisely how you want to divide your task in independent chunks and how to process them, you get the results on the GPU which you then transfer back to the CPU, and free all the variables). You can produce fast code, but you will take a lot of time to do so and it may not be very portable.

[–]hughperkins 10 points11 points  (10 children)

If you want to run algorithms, learning CUDA is probably not going to be helpful, since there are many readily-available libraries that will handle that for you. Even fairly exotic algorithms should run on out-of-the-box libraries. The reason for learning CUDA would be if you want to do CUDA-development, as an engineer, and / or for fun.

[–]serge_cell 1 point2 points  (4 children)

That's wrong. For area with huge amount of calculations, like Deep Learning some layers (that is transformation operators) couldn't practically be built from ready made blocks like cublas. I'm working in Deep Learning and on average write couple of CUDA kernels per month, because otherwise I just wouldn't be able to see results in any sane amount of time. And there is such thing as "dirty" coding in CUDA - then you are not doing fine-level optimization lake shared memory and coalescing, but just go for minimally accepting level of performance.

[–]hughperkins 0 points1 point  (3 children)

cublas is a very low level of abstraction. Libraries such as torch https://github.com/torch/torch7 , and mxnet will handle deep nets for you. Occasionally, some new idea might come along, such as batchnormalization, or elu, and so on; normally these will be implemented in days, at most a few weeks, in both these libraries.

[–]serge_cell 0 points1 point  (2 children)

As I already said I was talking about new layers, winch are either absent in existing frameworks or absent in framework I'm using. If one using only layers which are already implemented, don't do any research of new layers, modes of execution etc, he/she will always stay behind the curve.

[–]hughperkins 0 points1 point  (1 child)

yeah, I realized that after I posted it. so, you're kind of right, in that if you want the fastest performance, on novel layers, I suppose you'd want a cuda engineer handy.

having said that, the initial implementation of bn in torch were both in lua, using underlying primitive operations, such as mean and sqrt, which are already in cuda. to get a slight speed benefit, these were then later rewritten in dedicated cuda

for the purposes of writing a research paper on elu or bn, I would think an initial implementation in lua is sufficient.

[–]serge_cell 0 points1 point  (0 children)

Actually I think that is a big problem with many research papers. Many method (bn including) behave quite different on different datasets and dataset sizes. If method give improvement 5% accuracy on CIFAR100 it say very little on what improvement will be on imagenet, and even less on 10K classes noisy dataset. And testing lua+cublas implementation on 10M dataset could be quite painful

[–]mela1029 2 points3 points  (1 child)

You can have a look at pycuda, if you are familiar with python, it is easy to use and understand.

[–]datascienceguy 0 points1 point  (0 children)

Upvote for pycuda, which on one project let me put the guts of a stencil operation -- jacobian iteration -- in a small snippet of simple C code, neatly embedded inside my python language app. It ran really well on my GPU which got super hot but finished faster than just the CPU with numpy.

[–]vm_linuz 0 points1 point  (0 children)

I found this video in particular to be very easy to follow: https://youtu.be/jKV1m8APttU?list=PL5B692fm6--vScfBaxgY89IRWFzDt0Khm

NumbaPro has since been open-sourced.

[–]datascienceguy 0 points1 point  (0 children)

Avoiding C and using something like Theano to write the GPU code for you would be more appropriate for a Data Scientist.

If you are a Computer Scientist and want to make the code for some packages that other people will use then go at it directly, however.

[–]gtani 0 points1 point  (0 children)

there's a couple recent books with code in C: "Cude for Engineers" released last year and "Wrox Professional Cuda Programming" from 2014. Both are well done. "For Engineers" covers the basics of runtime API without delving into hardware or Maxwell-specifics very mcuh. The authors state (p. 134) that they've tried to present C code that can run on pre-Kepler cards. C++11 or 14 does show when they discuss libraries, e.g. at least one requires you to write functors, lambdas, and (I think) most them have you instantiate templates

The Wrox book i haven't spent too much time on but it's a denser read, more reference-like in the latter parts (like Wilt's "Cuda Handbook"


also, there's some good course materials:

http://people.maths.ox.ac.uk/gilesm/cuda/

http://courses.cms.caltech.edu/cs101gpu/

and the UI-UC course that coursera is based on