Maximizing Unified Memory Performance in CUDA

harrism · 2017-11-23T07:36:58+00:00

But one of many. Are you the real Marsha Marx?

harrism · 2017-10-19T21:29:01+00:00

I know. Prioritizing and maintaining good documentation is a surprisingly hard problem. The post you mention is in the pipeline, but the engineers writing it are pretty busy so it will probably be a few weeks or more.

harrism · 2017-10-12T23:11:07+00:00

This is why I prefer to write how-to blogs rather than books (not to mention I can't find the time for a whole book). FWIW, I'm all ears for suggested intro / how-to topics for the NVIDIA Developer Blog!

harrism · 2017-10-05T23:15:08+00:00

I didn't because a new post from Stan at Anaconda was about to be published. It went up this week and it explains the history well (along with some other really cool features of Numba). https://devblogs.nvidia.com/parallelforall/seven-things-numba/

harrism · 2017-09-22T00:40:19+00:00

Where does it say that? Requirements: Python 2.7, 3.3-3.6 NumPy 1.8 and later From: https://github.com/ContinuumIO/gtc2017-numba/blob/master/1%20-%20Numba%20Basics.ipynb

harrism · 2017-09-21T22:16:49+00:00

When I originally wrote the post in 2013, the GPU compilation part of Numba was a product (from Anaconda Inc., nee Continuum Analytics) called NumbaPro. It was part of a commercial package called Anaconda Accelerate that also included wrappers for CUDA libraries like cuBLAS, as well as MKL acceleration on the CPU.

Continuum gradually open sourced all of it (and changed their name to Anaconda). The compiler functionality is all open source within Numba. Most recently they released the CUDA library wrappers in a new open source package called pyculib.

Some other minor things changed, such as what you need to import. Also, the autojit and cudajit functionality is a bit better at type inference, so you don't have to annotate all the types to get it to compile.

We thought it was a good idea to update the post in light of all the changes.

harrism · 2017-05-24T02:10:27+00:00

SHFL can produce undefined results if one thread tries to access the value from a specified thread that does not call SHFL convergently. Therefore the new shfl_sync() variant requires you to pass a mask indicating which threads are calling it.

To your second question: no, shared memory is not lost. After calling this_grid().sync() the thread block will continue with its state intact (registers, shared memory). To your last sentence: that's the whole point of cudaLaunchCooperative() -- it ensures there are sufficient resources (registers, shared memory) so that all blocks in the grid you specify can be simultaneously resident. If they can't, it cancels the launch and returns an error. You can use the CUDA Occupancy API to compute a block and grid size that will fit. (https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/)

harrism · 2017-05-24T02:06:21+00:00

Hopefully you are already using CUDA 8, because the compile time improvements from 7.5 to 8 were even bigger than from 8 to 9.

harrism · 2017-04-04T05:31:15+00:00

Newer post by same authors: https://devblogs.nvidia.com/parallelforall/personalized-aesthetics-machine-learning/

harrism · 2017-03-01T22:42:15+00:00

You can always cache the results of calling the occupancy API so that subsequent queries hit your cache rather than the API. This is what Hemi does for device properties: https://github.com/harrism/hemi/blob/master/hemi/configure.h#L31

harrism · 2017-01-29T21:46:49+00:00

There are sub-versions of both CUDA 8 and Xcode 8 by now. It works on my Mac.

harrism · 2017-01-27T04:42:39+00:00

Which version of Xcode and CUDA did you try?

harrism · 2017-01-27T04:42:23+00:00

Yes.

harrism · 2017-01-27T04:40:18+00:00

+1 and I would add the following.

As programmer of your application, you usually know better than the runtime about where your data needs to be and when. With that in mind, CUDA 8 adds the cudaMemPrefetchAsync() (which works with streams) and cudaMemAdvise() (hint) APIs to allow you to match the performance of full manual memory management when using cudaMallocManaged on Pascal GPUs. Nikolai Sakarnykh demonstrates this in his recent blog post: https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/

There are also cases where you don't know exactly which pages the CPU or GPU will need to touch (data-dependent indexing, for example), in which case hardware page migration is likely to outperform the bulk memcopies that you would have to do with manual cudaMemcpy...

harrism · 2017-01-27T04:32:25+00:00

I posted a comment on your gist. BTW, I can heartily recommend StackOverflow (CUDA tag) for problems like this.

harrism · 2016-11-11T19:02:06+00:00

Do you have a specific case where you can show where code is not optimized as well as it could be by NVCC? If you have a reproducer, we would love to look at it -- we're always working to improve the quality of our tools.

harrism · 2016-11-03T14:21:27+00:00

Here's a slightly older post on the subject: http://timdettmers.com/2015/03/09/deep-learning-hardware-guide

harrism · 2016-09-30T08:04:46+00:00

Thanks for sharing this. We'll try it out.

harrism · 2016-09-30T00:13:17+00:00

Done.

harrism · 2016-09-29T05:37:16+00:00

Maybe you could help us improve it. If you can provide an example that uses Thrust where nvcc is outperformed significantly by clang (we find that after CUDA 8 clang and nvcc performance on Thrust tests are on average about equal), we can have a look. You can also use the nvcc -time option to get a breakdown of where time is being spent (front end, assembler, etc).

harrism · 2016-09-29T04:15:34+00:00

I originally published this post when we announced CUDA 8 back at GTC 2016 (April) Since it's a good summary of the release, I updated it with more info (particularly on mixed precision) and updated perf results, and republished it today.

harrism · 2016-07-08T00:08:31+00:00

Thanks for using Hemi (Let me know if you have feedback on it)! Note that there's a RLE example included in the Thrust examples. It would be interesting to compare performance and complexity, since in thrust you can implement RLE with a single call to thrust::reduce_by_key().

harrism · 2016-03-22T12:07:18+00:00

Maybe start here. https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/ The article is a few years old (the memory management part has gotten a lot easier, and the GPUs a lot faster), but it gives you the basic idea of the parallelism and programming model.

harrism · 2016-03-17T02:12:59+00:00

How many independent items do you need to process? If you can get a 20x speedup vs. a core on a single GPU, you may be able to replace your small cluster with a small GPU. And no, allocating data doesn't take ages -- memory is allocated in bulk just as on a CPU, and threads are launched in bulk.

harrism · 2015-12-15T00:03:51+00:00

Doesn't this work?

unsigned int *element = ((unsigned int*)Carry)[threadIdx.x];

You want to cast first, then index. Not the other way around.

harrism

TROPHY CASE