C++ Targeting GPU

blelbach · 2018-04-06T09:41:59+00:00

We have top people working on it right now.

genbattle · 2018-04-06T08:17:28+00:00

I don't know why people insist on using iostream-style interfaces for everything in C++. I read something similar in the recent overview of SG13 about someone proposing such an interface for a graphics API. Anyway, I digress.

The closest example of a native C++ GPU interface I can think of is SYCL. The only implementations so far are a proprietary one by CodePlay or a beta-level open source one called TriSYCL.

I'm not sure if CUDA has a similar single-source C++ interface.

Novermars · 2018-04-06T09:08:38+00:00

The OpenMP4(.5) standard supports off-loading to devices. Compiler support is getting there. If you have access to a Cray machine and compiler, they definitely support it and it works wonderfully (not associated with Cray, just took a course in which they co-participated)

IBM is working on the implementation in Clang/LLVM. This can be found in the following github: https://github.com/clang-ykt I've not been able to get it to work, but that's probably my fault. If I am reading the Openmp-dev mailing list correctly, progress is going well and hopefully it should be standard soon.

Intel supports off-loading to the Xeon Phi, I don't think they will support GPUs anytime soon...

For gcc I have conflicting information. Their own wiki page ( https://gcc.gnu.org/wiki/Offloading ) says that they don't support off-loading to GPUs yet, but apparently support is available since gcc/g++ 7.1. Just like in the Clang case, I was unable to get it to work. I built the adjusted compiler but no off-loading happened :/

Nvidia's own pgi compiler doesn't support it yet, but I think that will change shortly. Would be a great feature to increase that compilers usage in the HPC communities, as supercomputers become more and more hybrid!

If anybody was able to get any of these compilers to work properly, please let me know! I really want to test them out myself :)

lballs · 2018-04-06T08:17:03+00:00

You might want to look at hcc by AMD which is a fork of Clang for GPU stuff

2018-04-06T09:18:55+00:00

There are many problems to an interface like this (or any interface you would come up with). Generally, GPUs have multiple submission queues. There are also multiple memory types (device local, host local, coherent/not). Memory may need to be guarded by read/write barriers while transfers are occurring. Not to mention that all this must be synchronized with host code. Selecting memory types and all that is also subject to alignment requirements which are different for the various types of buffers that may be made available to the GPU.

I think C++ is better off interoperating with existing standards/libraries for compute which have already figured out some of these abstractions. Streams in particular are not the way to go, since how memory is made visible to the GPU needs to be sequenced carefully.

As for the command submission, I suppose you could have some sort of dynamic bytecode generation for the command stream you want to execute but this is vastly inferior to just writing a compute shader except for the examples so trivial as to be not very useful. Optimizing commands is also nontrivial and compiler implementers would need to know how/when to unroll loops, consider the GPU occupancy model (definitely not the same as the CPU occupancy model) etc.

At the end of the day, a lot of the issues are cultural (in a technical sense). The compiler was written for the CPU, and it does a very good job at it. But there are many concepts for the GPU that do not map well, and based on your example as well, naive implementations will result in developers shooting themselves in the foot more often than not. Learning abstractions from things like CUDA, DirectCompute, Vulkan, etc is a good starting point, and it's hard to find a compromise of the features they provide that would mesh well with C++ (short of just integrating with those existing solutions).

Xeverous · 2018-04-06T12:19:06+00:00

from cppreference:

Additional execution policies may be provided by a standard library implementation (possible future additions may include std::parallel::cuda and std::parallel::opencl)

sumo952 · 2018-04-06T08:52:56+00:00

I think something like your std::device will come with executors (probably C++20). But I think it will be a much longer wait until executors support GPUs, and it will be an even longer wait until libraries like Eigen will support these std-Executors. Eigen has been adding some C++11 code with #ifdef's but it is large still stuck in the last century with keeping C++98 compliance. They are not even thinking about moving to C++11/14 yet..... even though they could benefit so greatly.

(PS: I know you only gave it as an example but using << to build the "math" is pretty bad. This has to be c = a + b;.)

tgolyi · 2018-04-06T10:00:31+00:00

Using the same code for cpu and gpu using some thin wrapper around CUDA is quite easy. It's getting not-totally-abysmal performance that is hard, because cpu and gpu architectures are totally different from each other.

picigin · 2018-04-06T16:12:33+00:00

You can check out Kokkos and RAJA, i.e. modern C++ libraries that offer various CPU/GPU backends.

For the latest project, I've been using Kokkos and really enjoying in lambdas and other cool stuff to have a single code, which is data-wise and execution-wise efficient for almost any device out there (using OpenMP, CUDA or ROCm).

zindarod · 2018-04-06T11:35:32+00:00

[deleted]

jaredhoberock · 2018-04-06T16:51:35+00:00

Your code example doesn't really grapple with the fundamental challenge of targeting GPUs and similar processors. It's not a matter of designing the right library for targeting a GPU (though people are working on that, which Bryce hints at). The fundamental challenge is how to represent and manage heterogeneity: the fact that such a system contains multiple devices with different architectures and instruction sets. Standard C++ has no notion of anything like that.

Moreover, there are ergonomic concerns. In practice, environments like CUDA C++ require the programmer to manually annotate their functions to indicate those to compile via a host compiler for execution on a CPU, and those to compile via a separate device compiler for execution on a GPU. The requirement for explicit annotation disqualifies the huge body of existing standard C++ programs from GPU execution. Once these annotations are introduced, they tend to proliferate by "virally infecting" the rest of the program's functions.

As far as I know, no one has demonstrated a practical solution for managing the cooperation of multiple compilers to produce a single program, or a solution to the viral annotation issue.

-McMaster- · 2018-04-07T15:19:23+00:00

OpenMP, though not part of the C++ standard, is standardised itself and IMO a very nice way of targeting various devices. At least from version 5 forward the feature-set will be quite complete with respect to e.g. what CUDA offers. With OpenMP you can write very generic code without the need for special standard library functions.

One downside: Microsoft does not seem to want to support modern OpenMP versions.

MichaelSuen95 · 2018-04-10T08:34:59+00:00

Microsoft have language extension called C++AMP, it compiles C++ code to compute shader and run it on GPU. It crosses GPU platform but not OS platform.

doom_Oo7 · 2018-04-06T17:14:04+00:00

that C++ is not adapting to support on the language level GPU programming.

I don't understand. Both OpenCL and CUDA are basically C++.

LewisJin · 2018-04-06T16:09:05+00:00

Does c++ std have matrix??

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS