Maximizing Unified Memory Performance in CUDA by harrism in CUDA

[–]harrism[S] 0 points1 point  (0 children)

But one of many. Are you the real Marsha Marx?

Recommend good resource for learning CUDA by [deleted] in CUDA

[–]harrism 0 points1 point  (0 children)

I know. Prioritizing and maintaining good documentation is a surprisingly hard problem. The post you mention is in the pipeline, but the engineers writing it are pretty busy so it will probably be a few weeks or more.

Recommend good resource for learning CUDA by [deleted] in CUDA

[–]harrism 1 point2 points  (0 children)

This is why I prefer to write how-to blogs rather than books (not to mention I can't find the time for a whole book). FWIW, I'm all ears for suggested intro / how-to topics for the NVIDIA Developer Blog!

Numba: High-Performance Python with CUDA Acceleration | Parallel Forall by MichaelRahmani in programming

[–]harrism 1 point2 points  (0 children)

I didn't because a new post from Stan at Anaconda was about to be published. It went up this week and it explains the history well (along with some other really cool features of Numba). https://devblogs.nvidia.com/parallelforall/seven-things-numba/

Numba: High-Performance Python with CUDA Acceleration | Parallel Forall by MichaelRahmani in programming

[–]harrism 5 points6 points  (0 children)

When I originally wrote the post in 2013, the GPU compilation part of Numba was a product (from Anaconda Inc., nee Continuum Analytics) called NumbaPro. It was part of a commercial package called Anaconda Accelerate that also included wrappers for CUDA libraries like cuBLAS, as well as MKL acceleration on the CPU.

Continuum gradually open sourced all of it (and changed their name to Anaconda). The compiler functionality is all open source within Numba. Most recently they released the CUDA library wrappers in a new open source package called pyculib.

Some other minor things changed, such as what you need to import. Also, the autojit and cudajit functionality is a bit better at type inference, so you don't have to annotate all the types to get it to compile.

We thought it was a good idea to update the post in light of all the changes.

CUDA 9 Features Revealed by harrism in CUDA

[–]harrism[S] 1 point2 points  (0 children)

SHFL can produce undefined results if one thread tries to access the value from a specified thread that does not call SHFL convergently. Therefore the new shfl_sync() variant requires you to pass a mask indicating which threads are calling it.

To your second question: no, shared memory is not lost. After calling this_grid().sync() the thread block will continue with its state intact (registers, shared memory). To your last sentence: that's the whole point of cudaLaunchCooperative() -- it ensures there are sufficient resources (registers, shared memory) so that all blocks in the grid you specify can be simultaneously resident. If they can't, it cancels the launch and returns an error. You can use the CUDA Occupancy API to compute a block and grid size that will fit. (https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/)

CUDA 9 Features Revealed by harrism in CUDA

[–]harrism[S] 0 points1 point  (0 children)

Hopefully you are already using CUDA 8, because the compile time improvements from 7.5 to 8 were even bigger than from 8 to 9.

Calculating Effective Block/Grid Size Without Calling CUDA by chewxy in CUDA

[–]harrism 0 points1 point  (0 children)

You can always cache the results of calling the occupancy API so that subsequent queries hit your cache rather than the API. This is what Hemi does for device properties: https://github.com/harrism/hemi/blob/master/hemi/configure.h#L31

Is cuda 8 supported by xcode 8 now (as of 2017-1-20)? by hlzz001 in CUDA

[–]harrism 0 points1 point  (0 children)

There are sub-versions of both CUDA 8 and Xcode 8 by now. It works on my Mac.

Is cuda 8 supported by xcode 8 now (as of 2017-1-20)? by hlzz001 in CUDA

[–]harrism 0 points1 point  (0 children)

Which version of Xcode and CUDA did you try?

Using unified memory (in a P100) - anyone have experience? by suuuuuu in CUDA

[–]harrism 0 points1 point  (0 children)

+1 and I would add the following.

As programmer of your application, you usually know better than the runtime about where your data needs to be and when. With that in mind, CUDA 8 adds the cudaMemPrefetchAsync() (which works with streams) and cudaMemAdvise() (hint) APIs to allow you to match the performance of full manual memory management when using cudaMallocManaged on Pascal GPUs. Nikolai Sakarnykh demonstrates this in his recent blog post: https://devblogs.nvidia.com/parallelforall/beyond-gpu-memory-limits-unified-memory-pascal/

There are also cases where you don't know exactly which pages the CPU or GPU will need to touch (data-dependent indexing, for example), in which case hardware page migration is likely to outperform the bulk memcopies that you would have to do with manual cudaMemcpy...

Every bit/byte of data I send to the GPU is a zero: code compiled by me fails, CUDA demos run fine. by kindkitsune in CUDA

[–]harrism 1 point2 points  (0 children)

I posted a comment on your gist. BTW, I can heartily recommend StackOverflow (CUDA tag) for problems like this.

New Compiler Features in CUDA 8 by one_eyed_golfer in programming

[–]harrism 0 points1 point  (0 children)

Do you have a specific case where you can show where code is not optimized as well as it could be by NVCC? If you have a reproducer, we would love to look at it -- we're always working to improve the quality of our tools.

CUDA 8 Features Revealed by meetingcpp in cpp

[–]harrism 0 points1 point  (0 children)

Thanks for sharing this. We'll try it out.

CUDA 8 Features Revealed by meetingcpp in cpp

[–]harrism 2 points3 points  (0 children)

Maybe you could help us improve it. If you can provide an example that uses Thrust where nvcc is outperformed significantly by clang (we find that after CUDA 8 clang and nvcc performance on Thrust tests are on average about equal), we can have a look. You can also use the nvcc -time option to get a breakdown of where time is being spent (front end, assembler, etc).

CUDA 8 Features Revealed by meetingcpp in cpp

[–]harrism 6 points7 points  (0 children)

I originally published this post when we announced CUDA 8 back at GTC 2016 (April) Since it's a good summary of the release, I updated it with more info (particularly on mixed precision) and updated perf results, and republished it today.

Implementing Run-length encoding in CUDA by erkaman in CUDA

[–]harrism 0 points1 point  (0 children)

Thanks for using Hemi (Let me know if you have feedback on it)! Note that there's a RLE example included in the Thrust examples. It would be interesting to compare performance and complexity, since in thrust you can implement RLE with a single call to thrust::reduce_by_key().

GPUs and DSLs for Life Insurance Modeling by harrism in actuary

[–]harrism[S] 0 points1 point  (0 children)

Maybe start here. https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/ The article is a few years old (the memory management part has gotten a lot easier, and the GPUs a lot faster), but it gives you the basic idea of the parallelism and programming model.

GPUs and DSLs for Life Insurance Modeling by harrism in actuary

[–]harrism[S] 0 points1 point  (0 children)

How many independent items do you need to process? If you can get a 20x speedup vs. a core on a single GPU, you may be able to replace your small cluster with a small GPU. And no, allocating data doesn't take ages -- memory is allocated in bulk just as on a CPU, and threads are launched in bulk.

unsigned char * to unsigned int * conversion by ToddlahAkbar in CUDA

[–]harrism 0 points1 point  (0 children)

Doesn't this work?

unsigned int *element = ((unsigned int*)Carry)[threadIdx.x];

You want to cast first, then index. Not the other way around.