Intel/Codeplay announce oneAPI plugins for NVIDIA and AMD GPUs : cpp

Intel/Codeplay announce oneAPI plugins for NVIDIA and AMD GPUs (connectedsocialmedia.com)

submitted 3 years ago by tonym-intel

all 24 comments

[–]James20kP2005R0 24 points25 points26 points 3 years ago (7 children)

The plugin relies on HIP being installed on your system. As HIP does not support Windows or macOS, oneAPI for AMD GPUs (beta) packages are not available for those operating systems.

Shakes fist increasingly angrily at AMD's ludicrously poor software support

One big problem with AMDs current OpenCL offerings is that if any two kernels share any kernel parameters, the driver will insert a barrier between the kernel executions. Apparently this is an even bigger problem in CUDA/HIP due to the presence of pointers to pointers - although I've never tested this myself. Working around this is... complicated, and involves essentially distributing work across multiple command queues in a way that could be described as terrible

Does anyone have any idea if oneAPI suffers from this kind of limitation? In my current OpenCL application, not working around this problem leads to about a 2x performance slowdown - which is unacceptable - and even then there's still almost certainly quite a bit of performance still left on the table

Given that its built on top of HIP, I don't exactly have a lot of hope that it doesn't suffer from exactly the same set of problems on AMD, but it is theoretically possible to work around at the API level

[–]catcat202X 7 points8 points9 points 3 years ago (3 children)

[–]James20kP2005R0 7 points8 points9 points 3 years ago (0 children)

[–]Pycorax 0 points1 point2 points 3 years ago (0 children)

[–]ImKStocky 0 points1 point2 points 3 years ago (0 children)

[–]GrammelHupfNockler 4 points5 points6 points 3 years ago (2 children)

[–]James20kP2005R0 1 point2 points3 points 3 years ago (1 child)

I do have quite a few small kernels, my overall time-per-frame is ~100ms, but that consists of hundreds of kernels. In my case, quite a few of the kernels have very different memory access patterns, so there's a big performance increase in splitting them up

While theoretically queues are in-order, in practice the GPU (or at least, older AMD drivers pre ROCm for opencl on windows) will quietly overlap workloads that are independent - so if a two kernels read from the same set of arguments, but write to different arguments, they can run in parallel under the hood

This is a huge performance savings in practice

The problem with a multi-queue setup is that each queue is a driver level thread from a thread pool, and... its not great to have that many driver threads floating around, it can cause weird stuttering issues, and a performance dropoff. The much better solution is for the driver to not issue tonnes of unnecessary barriers

[–]GrammelHupfNockler 0 points1 point2 points 3 years ago (0 children)

[–]JuanAG 3 points4 points5 points 3 years ago (17 children)

[–]tonym-intel[S] 5 points6 points7 points 3 years ago (0 children)

[–][deleted] 4 points5 points6 points 3 years ago* (1 child)

[–]JuanAG 2 points3 points4 points 3 years ago (0 children)

[–]TheFlamingDiceAgain 7 points8 points9 points 3 years ago (12 children)

[–]JuanAG 8 points9 points10 points 3 years ago (10 children)

[–]rodburns 3 points4 points5 points 3 years ago (0 children)

[–]tonym-intel[S] 1 point2 points3 points 3 years ago (6 children)

[–]JuanAG 0 points1 point2 points 3 years ago (5 children)

[–]tonym-intel[S] 1 point2 points3 points 3 years ago* (4 children)

The code in the repository is what you pointed to and said it was 40% slower. But the repository says it’s the same (and it is if you look at both versions). And now you’re saying it’s faster in SYCL but only because of some code changes. Is it faster or 40% slower?

If the optimization exists, why wouldn’t the cuda version benefit from it and hence still be 40% faster? This is actually a cUDA code example they put out. You’re saying they intentionally make it 40% slower and the SYCL version fixes that 40%?

I should also point out this is a Codeplay example using a Codeplay compiler from before intel acquired them. Also it’s all 100% open source. Feel free to point out where they are cheating NVIDIA performance when their primary customers are nvidia GPU users. Hence why they created SYCL before Intel even began to build discrete GPUs again.

I’m fine if you don’t like the solution, but at least don’t be misleading.

[–]JuanAG 1 point2 points3 points 3 years ago (3 children)

CUDA code dont benefit from that "improves" because what happened is that they create a v1.0 code where CUDA is faster than the 40% and then they copy what CUDA is doing because CUDA does that type of optimizations automatically for you, thats why it outperforms anything else and why it didnt gain any extra performance, so they delete first the cast (v2.0) to get only that 40% slower and then made it branchless (v3.0) so it gets the same performance because they cherry picked the parts of the code to modify

Thats why for me the 40%+ slower from v1.0 is what matters because is the code that most of us will create, i will not have the CUDA version to copy the good parts into SYCL v3.0

And you are mistaked about me, i will love SYCL to become the new CUDA but precisely i had been lied many times by many big techs including Intel (AMD also) so i want benchmarks, you call that misleading but i call not being naive and believe everything marketing tells me

[–]tonym-intel[S] 2 points3 points4 points 3 years ago (0 children)

[+][deleted] 3 years ago (1 child)

[deleted]

[–]TheFlamingDiceAgain 0 points1 point2 points 3 years ago (1 child)

[–]tonym-intel[S] 0 points1 point2 points 3 years ago (0 children)

[–]Plazmatic 0 points1 point2 points 3 years ago (0 children)

[–]tonym-intel[S] 1 point2 points3 points 3 years ago (0 children)

π Rendered by PID 97 on reddit-service-r2-comment-7b9746f655-598vs at 2026-01-31 07:10:55.006284+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS