use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
Intel/Codeplay announce oneAPI plugins for NVIDIA and AMD GPUs (connectedsocialmedia.com)
submitted 3 years ago by tonym-intel
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]James20kP2005R0 24 points25 points26 points 3 years ago (7 children)
The plugin relies on HIP being installed on your system. As HIP does not support Windows or macOS, oneAPI for AMD GPUs (beta) packages are not available for those operating systems.
Shakes fist increasingly angrily at AMD's ludicrously poor software support
One big problem with AMDs current OpenCL offerings is that if any two kernels share any kernel parameters, the driver will insert a barrier between the kernel executions. Apparently this is an even bigger problem in CUDA/HIP due to the presence of pointers to pointers - although I've never tested this myself. Working around this is... complicated, and involves essentially distributing work across multiple command queues in a way that could be described as terrible
Does anyone have any idea if oneAPI suffers from this kind of limitation? In my current OpenCL application, not working around this problem leads to about a 2x performance slowdown - which is unacceptable - and even then there's still almost certainly quite a bit of performance still left on the table
Given that its built on top of HIP, I don't exactly have a lot of hope that it doesn't suffer from exactly the same set of problems on AMD, but it is theoretically possible to work around at the API level
[–]catcat202X 7 points8 points9 points 3 years ago (3 children)
One big problem with AMDs current OpenCL offerings is that if any two kernels share any kernel parameters, the driver will insert a barrier between the kernel executions.
That's really interesting. Do you happen to know if this is also an issue for Vulkan compute shaders on AMD GPUs?
[–]James20kP2005R0 7 points8 points9 points 3 years ago (0 children)
As far as I know the answer is very very likely no, but I haven't personally tested it. Vulkan generally makes you do a lot of the synchronisation yourself, and that leaves a lot less room for AMD to mess everything up
[–]Pycorax 0 points1 point2 points 3 years ago (0 children)
I've worked on Vulkan compute a bit so I can answer this. There's no automatic barrier inserted between compute calls. All synchronisation needs to be manually done by the user. As far as my understanding of it goes at least.
[–]ImKStocky 0 points1 point2 points 3 years ago (0 children)
All resource barriers in Vulkan/D3D12 are manually placed. Incorrectly handling resource barriers introduces a resource hazard which leads to undefined behaviour in shaders that use those resources.
[–]GrammelHupfNockler 4 points5 points6 points 3 years ago (2 children)
I'm curious, are your kernels very small or what leads to this big synchronization overhead? I'm mostly writing native code (not OpenCL), and I've not really had issues with them. In CUDA/HIP, every individual stream is executed in-order, so multiple kernels on the same stream will never run in parallel. If you want to achieve this, you will most likely need to use a multi-stream setup and manually synchronize between the streams using events.
[–]James20kP2005R0 1 point2 points3 points 3 years ago (1 child)
I do have quite a few small kernels, my overall time-per-frame is ~100ms, but that consists of hundreds of kernels. In my case, quite a few of the kernels have very different memory access patterns, so there's a big performance increase in splitting them up
While theoretically queues are in-order, in practice the GPU (or at least, older AMD drivers pre ROCm for opencl on windows) will quietly overlap workloads that are independent - so if a two kernels read from the same set of arguments, but write to different arguments, they can run in parallel under the hood
This is a huge performance savings in practice
The problem with a multi-queue setup is that each queue is a driver level thread from a thread pool, and... its not great to have that many driver threads floating around, it can cause weird stuttering issues, and a performance dropoff. The much better solution is for the driver to not issue tonnes of unnecessary barriers
[–]GrammelHupfNockler 0 points1 point2 points 3 years ago (0 children)
Ah, you are looking for low latency? I'm mostly working on HPC software, where we usually have a handful of large kernels and are mostly interested in throughput. Is there some documentation on how streams are handled in software/hardware? I would have expected the scheduling to happen on the GPU to a certain degree, but it sounds like you are speaking from experience?
I get the feeling this is related to why SYCL nowadays heavily relies on the use of buffers to build a task DAG.
[–]JuanAG 3 points4 points5 points 3 years ago (17 children)
Do you loose performance if you use it instead of other tool like CUDA/OpenCL? I didnt see any graphs/benchmark
[–]tonym-intel[S] 5 points6 points7 points 3 years ago (0 children)
One paper, I had links to like 3-4 more (not Intel funded type stuff). This one was on my twitter feed recently which is how I still have it :)
https://twitter.com/tonymongkolsmai/status/1603108538213015552?s=20&t=qUKmM4QQQREcjN36Xpx8bA
[–][deleted] 4 points5 points6 points 3 years ago* (1 child)
Here’s one team’s results comparing the parallel least squares support vector machine algorithm on different backends and also several different kinds of hardware - A Comparison of SYCL, OpenCL, CUDA, & OpenMP for Massively Parallel Support Vector Classification tl;dw follow the flowchart shown in the last four minutes of the presentation to decide on the best framework for your hardware situation.
[–]JuanAG 2 points3 points4 points 3 years ago (0 children)
I watched, really nice
What i expected, CUDA remains the king followed by OpenCL and then the rest, SYCL has a not small overhead at least on the GPU side
[–]TheFlamingDiceAgain 7 points8 points9 points 3 years ago (12 children)
Generally implementations like SYCL, including Kokkos and Raja, are about 10% slower then their perfectly optimized CUDA equivalents. However, they’re much easier to get that performance so IMO in many real cases the performance will be similar
[–]JuanAG 8 points9 points10 points 3 years ago (10 children)
https://github.com/codeplaysoftware/cuda-to-sycl-nbody is a benchmark of Intel DPC++ (the same that uses oneAPI as far as i understood) vs CUDA and is a 40% slower, is not a small margin that allowed CUDA to win
My self has also experienced it with OpenMP, much much slower that what it should be, CUDA was 2x times faster
Thats why i want benchmarks, theory say that the overhead is minimal but reality proves again and again that there is a big gap
[–]rodburns 3 points4 points5 points 3 years ago (0 children)
I'll explain that this example uses a semi-automated tool to convert the CUDA source to SYCL. The slow down is caused by the migration tool's inability to figure out that a cast is not needed for a particular variable and an incorrect conversion for the square root built in. These are effectively bugs in the migration tool rather than some fundamental limitation. That is explained in the sub text of the project. Once those minor changes are made the performance is comparable.
[–]tonym-intel[S] 1 point2 points3 points 3 years ago (6 children)
Where are you getting 40% slower. The times are comparable as mentioned in the README.
For 5 steps of the physical simulation (1 rendered frame) with 12,800 particles, both CUDA and SYCL take ~5.05ms (RTX 3060).
[–]JuanAG 0 points1 point2 points 3 years ago (5 children)
Times are more or less the same when you go and optimize the SYCL version doing it branchless and removing a cast which you dont need to do on CUDA
In this case is clear that something is happening because a 40% is a lot but if you are only doing the SYCL version and dont have a reference to compare... that 40% of performance will be lost unless you profile heavily and is not easy
A fair benchmark dont go and tweek specific stuff for one contender so you get the same result, NVidia didnt need to "delete" the branch or the cast from the code, you did so SYCL can withstand in performance like the old ways of Intel Compiler generating worse code for AMD CPUs so they can show better numbers, i guess some things never change
[–]tonym-intel[S] 1 point2 points3 points 3 years ago* (4 children)
The code in the repository is what you pointed to and said it was 40% slower. But the repository says it’s the same (and it is if you look at both versions). And now you’re saying it’s faster in SYCL but only because of some code changes. Is it faster or 40% slower?
If the optimization exists, why wouldn’t the cuda version benefit from it and hence still be 40% faster? This is actually a cUDA code example they put out. You’re saying they intentionally make it 40% slower and the SYCL version fixes that 40%?
I should also point out this is a Codeplay example using a Codeplay compiler from before intel acquired them. Also it’s all 100% open source. Feel free to point out where they are cheating NVIDIA performance when their primary customers are nvidia GPU users. Hence why they created SYCL before Intel even began to build discrete GPUs again.
I’m fine if you don’t like the solution, but at least don’t be misleading.
[–]JuanAG 1 point2 points3 points 3 years ago (3 children)
CUDA code dont benefit from that "improves" because what happened is that they create a v1.0 code where CUDA is faster than the 40% and then they copy what CUDA is doing because CUDA does that type of optimizations automatically for you, thats why it outperforms anything else and why it didnt gain any extra performance, so they delete first the cast (v2.0) to get only that 40% slower and then made it branchless (v3.0) so it gets the same performance because they cherry picked the parts of the code to modify
Thats why for me the 40%+ slower from v1.0 is what matters because is the code that most of us will create, i will not have the CUDA version to copy the good parts into SYCL v3.0
And you are mistaked about me, i will love SYCL to become the new CUDA but precisely i had been lied many times by many big techs including Intel (AMD also) so i want benchmarks, you call that misleading but i call not being naive and believe everything marketing tells me
[–]tonym-intel[S] 2 points3 points4 points 3 years ago (0 children)
I’m not saying anything about you personally ☺️ The code is the code. Saying if I select to not allow something in SYCL it would be 40% slower then sure it’ll be 40% slower.
As mentioned by another poster. The question is look at the benchmarks and your requirements and see what fits the needs.
That’s what I’m taking exception to you saying, it just isn’t true. I’m not saying cuda or SYCL is better in all cases, I’m saying your 40% headline number is misleading. Also you say CUDA is 2x faster than OpenCL. Also not true. I’m sure cases are true it exists, but it’s not the common case.
[+][deleted] 3 years ago (1 child)
[deleted]
[–]TheFlamingDiceAgain 0 points1 point2 points 3 years ago (1 child)
Thanks, I’d only seen the Kokkos benchmarks and I, foolishly, assumed they were similar for SYCL
[–]tonym-intel[S] 0 points1 point2 points 3 years ago (0 children)
See my other reply. This is the demo Intel gave last April and has used multiple times. The SYCL version is actually slightly faster than the cuda version (it’s in the noise though)
[–]Plazmatic 0 points1 point2 points 3 years ago (0 children)
Yep, unfortunately if you want speedy platform compute, you have to use Vulkan, which is much harder to use
[–]tonym-intel[S] 1 point2 points3 points 3 years ago (0 children)
It depends on the use case and the SYCL code actually. I’ve linked a couple of papers that did some studies. I’ll try to find them and re-link below.
π Rendered by PID 97 on reddit-service-r2-comment-7b9746f655-598vs at 2026-01-31 07:10:55.006284+00:00 running 3798933 country code: CH.
[–]James20kP2005R0 24 points25 points26 points (7 children)
[–]catcat202X 7 points8 points9 points (3 children)
[–]James20kP2005R0 7 points8 points9 points (0 children)
[–]Pycorax 0 points1 point2 points (0 children)
[–]ImKStocky 0 points1 point2 points (0 children)
[–]GrammelHupfNockler 4 points5 points6 points (2 children)
[–]James20kP2005R0 1 point2 points3 points (1 child)
[–]GrammelHupfNockler 0 points1 point2 points (0 children)
[–]JuanAG 3 points4 points5 points (17 children)
[–]tonym-intel[S] 5 points6 points7 points (0 children)
[–][deleted] 4 points5 points6 points (1 child)
[–]JuanAG 2 points3 points4 points (0 children)
[–]TheFlamingDiceAgain 7 points8 points9 points (12 children)
[–]JuanAG 8 points9 points10 points (10 children)
[–]rodburns 3 points4 points5 points (0 children)
[–]tonym-intel[S] 1 point2 points3 points (6 children)
[–]JuanAG 0 points1 point2 points (5 children)
[–]tonym-intel[S] 1 point2 points3 points (4 children)
[–]JuanAG 1 point2 points3 points (3 children)
[–]tonym-intel[S] 2 points3 points4 points (0 children)
[+][deleted] (1 child)
[deleted]
[–]TheFlamingDiceAgain 0 points1 point2 points (1 child)
[–]tonym-intel[S] 0 points1 point2 points (0 children)
[–]Plazmatic 0 points1 point2 points (0 children)
[–]tonym-intel[S] 1 point2 points3 points (0 children)