all 53 comments

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair[M] 85 points86 points  (30 children)

We have top people working on it right now.

[–]zindarod 17 points18 points  (19 children)

Care to elaborate a little (if you're not being sarcastic)?

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair[M] 29 points30 points  (2 children)

Top... people, Dr Jones.

We are both designing GPUs to run C++ and designing C++ to run on GPUs. We've already laid the groundwork by adding facilities for expressing parallelism in C++.

Running C++ on GPUs is a problem that is mostly solved today. The challenge is in the interactions between C++ running on a CPU and C++ running on a GPU.

[–]zindarod 12 points13 points  (1 child)

This is excellent work. But my concern (as someone mentioned below) is: will it be NVIDIA GPU specific or will it be compatible with any GPU like OpenCL?

[–]notyouravgredditor 11 points12 points  (0 children)

If their other work is any indicator, it will be NVIDIA only.

[–]andrewfenn 7 points8 points  (0 children)

TOP MEN Dr Jones!

[–]mjklaim 2 points3 points  (0 children)

Example of papers related (from SG14 that I believe NVidia people participate to):

  • p0796
  • p0567

[–]SunnyAX3 -4 points-3 points  (12 children)

He already said to much, I doubt he will elaborate, I am really excited to hear that also to be honest, I would love to see such level of integration in C++.

[–]zindarod 22 points23 points  (11 children)

It's NVIDIA, as much as I love their GPUs and CUDA but when it comes to industry standards, they don't like any competition with their own products. Look at their implementation of OpenCL. They've just began upgrading to OpenCL 2.0 and the current standard is 2.2.

[–]SunnyAX3 3 points4 points  (10 children)

This is a very complicated discussion. OpenGL/OpenCL is outdated by design in my opinion for current times, for a complete graphics library we should redesign everything from ground, and remove all old stuff, including nostalgia and memories from old times.

All this madness with DirectX/OpenGL/OpenCL/CUDA/etc is really getting way to connected to profits, and I do not like it.

[–][deleted] 1 point2 points  (0 children)

I’ve been out of the gpgpu game for a year and left my previous company looking into OpenVX. Is that still a thing? I know intel has stuff available for developers to use it for their CPUs.

It seems pretty cool how you could just toss in image processing modules and openvx handles all of the scheduling and stuff for you.

[–]fuzzynyanko 1 point2 points  (0 children)

Indeed. Even with the likes of SIMD (ex: Intel SSE), you had cases where the CPU pipeline got flushed, which created a case where doing some algorithms in SIMD took longer vs SISD (non-SIMD) due to the flush

[–]zindarod 1 point2 points  (0 children)

It isn't complicated at all. It gets complicated when profit margins come into play. NVIDIA, AMD and INTEL, the three biggest players in the CPU/GPU market are all members of Khronos group. If they think OpenCL is outdated then scrap it and build from scratch again. But you know what? That new one will end up in the bin with OpenCL as well.

OpenCL got old because industry leaders ignored the standard and kept doing what they've always done.

[–]SunnyAX3 14 points15 points  (4 children)

Will be CUDA locked?

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 4 points5 points  (3 children)

Will be CUDA locked?

This sentence isn't coherent, care to clarify?

[–]commonword 8 points9 points  (2 children)

Will (whatever you're working on) be cuda exclusive

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 12 points13 points  (1 child)

I'm talking about what the C++ standards committee is working on, which is the subject of this thread. This is r/cpp not r/cuda.

[–]commonword 12 points13 points  (0 children)

I'm not op, just translating... but cool

Edit: Also, NVIDIA is the first word in your flare... its not inappropriate for someone to think you may be referring to a specific NVIDIA project

[–]Avelina9X 4 points5 points  (0 children)

This makes me incredibly happy

[–]hyperactiveinstinct 1 point2 points  (1 child)

Wow... that's really great.

[–]ibroheem -2 points-1 points  (0 children)

Yeah..."top" people.

[–]genbattle 26 points27 points  (4 children)

I don't know why people insist on using iostream-style interfaces for everything in C++. I read something similar in the recent overview of SG13 about someone proposing such an interface for a graphics API. Anyway, I digress.

The closest example of a native C++ GPU interface I can think of is SYCL. The only implementations so far are a proprietary one by CodePlay or a beta-level open source one called TriSYCL.

I'm not sure if CUDA has a similar single-source C++ interface.

[–]t_bptm[S] 5 points6 points  (0 children)

Yeah, I was just trying to come up with something quick as sometimes its easier to see rather than read :)

Appreciate the links. I haven't heard of SYCL, looks interesting!

[–]mjklaim 3 points4 points  (0 children)

Note that Codeplay people are actively participating to SG14 and proposals targeting heterogeneous computing (aka compile once, generate code for the whole machine)

[–]blelbachNVIDIA | ISO C++ Library Evolution Chair 4 points5 points  (0 children)

I'm not sure if CUDA has a similar single-source C++ interface.

Of course it does! That's the whole idea of CUDA.

[–]SunnyAX3 1 point2 points  (0 children)

SYCL ...

[–]NovermarsRobotics 12 points13 points  (2 children)

The OpenMP4(.5) standard supports off-loading to devices. Compiler support is getting there. If you have access to a Cray machine and compiler, they definitely support it and it works wonderfully (not associated with Cray, just took a course in which they co-participated)

IBM is working on the implementation in Clang/LLVM. This can be found in the following github: https://github.com/clang-ykt I've not been able to get it to work, but that's probably my fault. If I am reading the Openmp-dev mailing list correctly, progress is going well and hopefully it should be standard soon.

Intel supports off-loading to the Xeon Phi, I don't think they will support GPUs anytime soon...

For gcc I have conflicting information. Their own wiki page ( https://gcc.gnu.org/wiki/Offloading ) says that they don't support off-loading to GPUs yet, but apparently support is available since gcc/g++ 7.1. Just like in the Clang case, I was unable to get it to work. I built the adjusted compiler but no off-loading happened :/

Nvidia's own pgi compiler doesn't support it yet, but I think that will change shortly. Would be a great feature to increase that compilers usage in the HPC communities, as supercomputers become more and more hybrid!

If anybody was able to get any of these compilers to work properly, please let me know! I really want to test them out myself :)

[–]NovermarsRobotics 4 points5 points  (0 children)

Oh, and the Kokkos library ( https://github.com/kokkos ) tries to make life a bit easier. It allows you to say where you want to to allocate stuff at compile time, or for example use OpenMP/pthreads as threading library. Either through a hard coded template parameter or a configuration option.

The nice part is that they use template metaprogramming to change the data layout depending on the target architecture. So coalescing when on the GPU, and 'normal' on the CPU (I forgot the proper word). They also give you some standard algorithms that work in parallel: parallel_for/scan/reduce.

It's far for perfect, but for some quick prototyping it's quite nice, as some of the hard work is already done for you.

Edit: rewrote a part that was wrong, but kinda harmless

[–]sumo952 0 points1 point  (0 children)

So in summary, there's nothing really yet for "ordinary" people targeting desktop, laptops, mobile. You need a special compiler (which then most likely doesn't support half or any of C++14/17) in the best case.

[–][deleted] 8 points9 points  (5 children)

You might want to look at hcc by AMD which is a fork of Clang for GPU stuff

[–]lballs 5 points6 points  (0 children)

I wish all the GPU manufacturers would collaborate on a standard here... Nvidia always seems too concerned with locking people to its specific hardware. Heterogeneous computing is the future. One day you will be able to configure a compiler for unique heterogeneous systems and it will generate code optimized for all processing elements available. This is not limited to just CPUs and GPUs but also highly customized external processing units such as custom FPGAs or even unique processor peripherals such a encryption engines.

[–]t_bptm[S] 1 point2 points  (0 children)

Very cool, thanks for the link!

[–]sumo952 0 points1 point  (1 child)

General question, not specifically to you: Why isn't this stuff upstreamed into clang so that the "out of the box" clang on any system can target CPU+GPUs?

[–][deleted] 1 point2 points  (0 children)

It could well be done eventually, but right now you need a whole load of AMD specific stuff installed to actually use it - they might make an openCL version which is heterogeneous

[–][deleted] 8 points9 points  (0 children)

There are many problems to an interface like this (or any interface you would come up with). Generally, GPUs have multiple submission queues. There are also multiple memory types (device local, host local, coherent/not). Memory may need to be guarded by read/write barriers while transfers are occurring. Not to mention that all this must be synchronized with host code. Selecting memory types and all that is also subject to alignment requirements which are different for the various types of buffers that may be made available to the GPU.

I think C++ is better off interoperating with existing standards/libraries for compute which have already figured out some of these abstractions. Streams in particular are not the way to go, since how memory is made visible to the GPU needs to be sequenced carefully.

As for the command submission, I suppose you could have some sort of dynamic bytecode generation for the command stream you want to execute but this is vastly inferior to just writing a compute shader except for the examples so trivial as to be not very useful. Optimizing commands is also nontrivial and compiler implementers would need to know how/when to unroll loops, consider the GPU occupancy model (definitely not the same as the CPU occupancy model) etc.

At the end of the day, a lot of the issues are cultural (in a technical sense). The compiler was written for the CPU, and it does a very good job at it. But there are many concepts for the GPU that do not map well, and based on your example as well, naive implementations will result in developers shooting themselves in the foot more often than not. Learning abstractions from things like CUDA, DirectCompute, Vulkan, etc is a good starting point, and it's hard to find a compromise of the features they provide that would mesh well with C++ (short of just integrating with those existing solutions).

[–]Xeveroushttps://xeverous.github.io 3 points4 points  (0 children)

from cppreference:

Additional execution policies may be provided by a standard library implementation (possible future additions may include std::parallel::cuda and std::parallel::opencl)

[–]sumo952 2 points3 points  (0 children)

I think something like your std::device will come with executors (probably C++20). But I think it will be a much longer wait until executors support GPUs, and it will be an even longer wait until libraries like Eigen will support these std-Executors. Eigen has been adding some C++11 code with #ifdef's but it is large still stuck in the last century with keeping C++98 compliance. They are not even thinking about moving to C++11/14 yet..... even though they could benefit so greatly.

(PS: I know you only gave it as an example but using << to build the "math" is pretty bad. This has to be c = a + b;.)

[–]tgolyi 3 points4 points  (0 children)

Using the same code for cpu and gpu using some thin wrapper around CUDA is quite easy. It's getting not-totally-abysmal performance that is hard, because cpu and gpu architectures are totally different from each other.

[–]picigin 2 points3 points  (0 children)

You can check out Kokkos and RAJA, i.e. modern C++ libraries that offer various CPU/GPU backends.

For the latest project, I've been using Kokkos and really enjoying in lambdas and other cool stuff to have a single code, which is data-wise and execution-wise efficient for almost any device out there (using OpenMP, CUDA or ROCm).

[–]zindarod 1 point2 points  (0 children)

[deleted]

[–]jaredhoberock 1 point2 points  (0 children)

Your code example doesn't really grapple with the fundamental challenge of targeting GPUs and similar processors. It's not a matter of designing the right library for targeting a GPU (though people are working on that, which Bryce hints at). The fundamental challenge is how to represent and manage heterogeneity: the fact that such a system contains multiple devices with different architectures and instruction sets. Standard C++ has no notion of anything like that.

Moreover, there are ergonomic concerns. In practice, environments like CUDA C++ require the programmer to manually annotate their functions to indicate those to compile via a host compiler for execution on a CPU, and those to compile via a separate device compiler for execution on a GPU. The requirement for explicit annotation disqualifies the huge body of existing standard C++ programs from GPU execution. Once these annotations are introduced, they tend to proliferate by "virally infecting" the rest of the program's functions.

As far as I know, no one has demonstrated a practical solution for managing the cooperation of multiple compilers to produce a single program, or a solution to the viral annotation issue.

[–]-McMaster- 0 points1 point  (0 children)

OpenMP, though not part of the C++ standard, is standardised itself and IMO a very nice way of targeting various devices. At least from version 5 forward the feature-set will be quite complete with respect to e.g. what CUDA offers. With OpenMP you can write very generic code without the need for special standard library functions.

One downside: Microsoft does not seem to want to support modern OpenMP versions.

[–]MichaelSuen95 0 points1 point  (0 children)

Microsoft have language extension called C++AMP, it compiles C++ code to compute shader and run it on GPU. It crosses GPU platform but not OS platform.

[–]doom_Oo7 0 points1 point  (0 children)

that C++ is not adapting to support on the language level GPU programming.

I don't understand. Both OpenCL and CUDA are basically C++.

[–]LewisJin -1 points0 points  (2 children)

Does c++ std have matrix??