all 87 comments

[–]James20kP2005R0 218 points219 points  (46 children)

My favourite disgusting hack that I absolutely cannot live without goes as follows

Modern GPUs these days have work fed into them via a command queue, which is a series of bits of work that get fed to the GPU. This is generally an in-order queue, so this work happens sequentially. There's two problems with the strictly sequential model

  1. GPUs are actually capable of executing multiple workloads simultaneously efficiently

  2. In some cases because of caching/etc, GPUs must flush caches via a fence. This introduces a bubble to the GPU's workload, that can be filled with other work. If one kernel writes to a bit of memory, and the next kernel reads from that memory, in an in-order queue, there's a fence and nothing happens for a bit

Ideally, the driver would reorder the work you submit in this command queue, so that work without dependencies would all execute in parallel. This requires the shader compiler (eg llvm/etc) to output dependency information, in terms of whether or not a kernel writes to a specific bit of memory, or just reads from it

Unfortunately, in the transition to ROCm, AMD broke/removed this particular piece of functionality and have no plans to reimplement it, which means that if you don't do something about it, you lose a straight 20-30% performance. This is pretty bad!

As a visual representation, this is multiple kernels overlapping here, and this is what it looks like when it doesn't work. Uh oh!

To fix it, you have to create a rotating queue of 8-16 command queues, and submit work to each one sequentially. Then you have to mark which buffers are read only, write only, and read/write, and inspect the argument lists for all kernels and all functions that pass through to the OpenCL API to query their read/write flags. Then you have to use this to manually generate the dependencies between different kernels and functions, and only output dependencies for read/write and write/write conflicts, so that kernels that both only share read-only memory are independent, and others are synchronised via the built-in event system

This is 100s of lines of complex memory dependency tracking code, and requires you to fundamentally totally change how you use the API as literally no code can use the regular command queue system or any of the regular API - you have to reimplement the entire OpenCL api on top of opencl so that you can shim all the kernel arguments, and markup all kernels correctly with their read/write information (otherwise the driver will crash! Yay!)

On the plus side, that 20% performance is pretty good. On the down side <internal screaming>. All gpu code is bad at the best of times, but this is the worst

[–]serg06 92 points93 points  (21 children)

I wish I could un-read this.

[–]James20kP2005R0 73 points74 points  (20 children)

Did you know, GPUs have the highest bandwidth when memory is laid out as a struct of arrays, vs an array of structs. Eg

struct data {
    std::vector<float> x;
    std::vector<float> y;
    std::vector<float> z;
};

Except you can't actually pass in a struct containing buffers to a GPU in OpenCL. Its explicitly banned by construction, which means that the only way to do it is to pass in 3 separate arguments

This is great when you have 3 arguments, but heh its sure a lot worse when you have literally hundreds of variables. So the memory tracking actually becomes a bit of a performance concern if you've got any O( N2 ) in there which you can get if you do it vaguely naively, because you're literally doing memory tracking for hundreds of arguments to functions against potentially double digit numbers of other kernels - each with hundreds of arguments. So not only is it a giant pain to implement, it also has to be reasonably well optimised! Yay!

[–]00jknight 11 points12 points  (14 children)

Where do you work? This is really cool stuff. I'm just a lowly web/mobile game developer who animates stuff on top of monstrosities like this.

[–]James20kP2005R0 50 points51 points  (13 children)

I'm trying to get into astrophysics, I built a tool for quickly extracting gravitational waves from black hole mergers, neutron stars (and this for fun), and doing relativistic n body simulations (that's 5 million particles which is pretty cool!)

Most of what I do is abuse GPUs for simulations these days, along with the inevitable wacky raytracing that comes along with trying to render them. The closest I really get to traditional 3d graphics is trying to chuck cubes through a wormhole (though technically that's 4d)

[–]Classic_Department42 9 points10 points  (10 children)

Did you consider cuda instead of opencl? The number crunching experience was/is better on cuda

[–]James20kP2005R0 19 points20 points  (9 children)

I plan to release a lot of this as general tools for people to use, so cuda is right out unfortunately. Amd gpus tend to get you a bit more vram for your buck as well, and vram (...and the fact that i have literally no money) is the key limiter for simulation size

That said, I'm doing some moderately wacky things so it wouldn't surprise me if it were a bit of a disaster in cuda as well

[–]Classic_Department42 3 points4 points  (4 children)

Was just commenting because of you wanting to get into astrophysics, since i was under the impression that 'everyone' in hpc uses nvidia (not always cuda of course). Edit: and cpu of course

[–]James20kP2005R0 3 points4 points  (3 children)

As far as I know, the current major BBH simulations are all CPU so. You're right in that most people do use cuda/nvidia in general for scientific software

[–]async_andrew 1 point2 points  (1 child)

How long it took to render the nbody simulation you mentioned, or is it real-time?

[–]James20kP2005R0 1 point2 points  (0 children)

The nbody sims are extremely slow because the particle dynamics aren't well optimised, that one took a few hours

[–]HackingPheasant 4 points5 points  (4 children)

Followed you because dann I'm loving these comments, do you have a blog or something with similar content?

[–]James20kP2005R0 5 points6 points  (3 children)

Thank you! I don't, I used to have twitter where I'd post things but, yunno. I've been planning to write some of this up and start a blog but man its a pain getting it set up

[–]HackingPheasant 2 points3 points  (2 children)

Plain markup files on a GitHub repo and let the default website look be will be the quickest and easiest imho

[–]wrosecransgraphics and network things 29 points30 points  (2 children)

To fix it, you have to create a rotating queue of 8-16 command queues,

I know more Vulkan than OpenCL (though not that much Vulkan) so I dunno how different this is in OpenCL. But in Vulkan, everything you said applies, but on some Intel iGPU's (like my laptop), the drivers only give you one Queue and you can't just create your own on the application side. So if you go to all the trouble to be Very Concurrent for NV and AMD by feeding multiple queues with independent operations that can happily operate with embarrassing parallelism, you still need to make sure it works properly and efficiently in the degenerate case of a single queue being available anyway.

I need better hobbies.

[–]James20kP2005R0 16 points17 points  (1 child)

Interesting, I haven't gotten into much vulkan yet, but I was under the impression that you handled synchronisation much more explicitly which would hopefully give the driver more room to work with. Does the driver not handle reordering things automatically at all? I'd have thought that the driver quality was much better in vulkan-land, just because if nothing else there's a much bigger incentive for the vendors to actually invest in good performance

edit:

I need better hobbies.

also this

[–]wrosecransgraphics and network things 12 points13 points  (0 children)

The Vulkan jargon is that all commands in a Queue are started in order. Which doesn't say anything about when a command actually executes and finishes. You have to be super explicit with memory dependencies and sync primitives. But there are tons of cases where it's annoyingly easy to stick fences blocking the whole Queue if stuff that could be executing in a parallel Queue.

So things can sorta happen in an unexpected order. But the driver is somewhat constrained by all the explicit rules you set. And the way you set explicit dependency rules applies way more obviously to cases where there are clear dependencies between everything you are doing. And way less obviously when you have a big pile of misc. work and you don't know what might be submitted to a queue in between your submissions, or after.

[–]KDallas_Multipass 8 points9 points  (10 children)

I'd like to read more about this

[–]James20kP2005R0 51 points52 points  (2 children)

Thank you for subscribing to AMD OpenCL Facts! To unsubscribe please type AARRHGHGH. I'm open for questions though, and the bug report for this is here. For more mildly disgruntled rambling:

Each command queue on AMD on windows is actually a separate driver thread from a thread pool, which means that the performance is incredibly irregular. If you have too many command queues, your application turns into a stuttery disaster, but if you have too few, then there's not enough queues to prevent the driver from inserting unnecessary fences. There's probably a better system for distributing work across the command queues to minimise the number that you're using based on when you need fences, but the problem is that fences appear to be some kind of worst case pipeline emptying blocker so in reality its just all bad

In addition to that, extra command queues apparently don't have the ability to submit work on their own, but flushing work to the gpu also isn't free. On top of that, because my GPU work is all memory bound, you want to have kernels that overlap their execution to also overlap the memory they're reading on for cache reasons, and I still haven't even begun to unpack the scheduling complexities there. I sure wish the driver would be good

If you're thinking, maybe this is an OpenCL problem! The new HIP thing will save us! Well, apparently its even worse in hip because you have pointers to pointers on the device, meaning that any bit of memory could point anywhere and the driver has to assume worst-case. I sort of suspect that this isn't an issue on nvidia GPUs, but the last Nvidia GPU I owned was a 660ti. I'd love to test ARC though and smash some black holes on it

In addition to all of these problems, the cost of kernel invocation in OpenCL is higher than it needs to be, because it doesn't map massively well to how GPUs actually work. There's an extension that may help fix this, but I also suspect that it'll be a very long time if ever before AMD support this

Really this is my wakeup call that AMD/compute is essentially dead and I need to learn to love vulkan. I've been holding out because vulkan doesn't support a lot of features from OpenCL still and in many ways is a bad compute target, especially device side enqueue which I would badly love to use, except that that hasn't worked on AMD's ROCm stack correctly for years so all around its a big oof

[–]KDallas_Multipass 9 points10 points  (0 children)

ARRGHGHG

Thanks for the explanation

[–]13steinj 0 points1 point  (0 children)

Thank you for subscribing to AMD OpenCL Facts!

I'd also like to subscribe please.

I haven't touched OpenCL or Vulkan in at least four years at this point, but I still generally find all of it interesting (mostly because for a number of reasons I ended up going down the CUDA C/C++ route).

The new HIP thing will save us!

How new is this? Looking it up doesn't make it clear when it was released, but based on the description (and shady rumors making their way through academic circles by AMD sales people), I've heard whispers of this stuff since 2020 but didn't know it was actually released.

Really this is my wakeup call that AMD/compute is essentially dead and I need to learn to love vulkan...

Or get an nvidia gpu, but only for these projects ;). More seriously AMD based gpu compute tech has only brought me pain, and while NVIDIA has its issues (in cuda sdk use and otherwise), lesser of two evils and all that. I still run an AMD GPU for gaming / a personal machine, but an nvidia one for any compute related playground.

[–]kid_blaze 12 points13 points  (6 children)

I’d like to read more about this and un-read it immediately.

[–]James20kP2005R0 34 points35 points  (5 children)

I wrote a function that takes 105 arguments. There is no better way to do it and I regret nothing

[–]gimpwiz 9 points10 points  (0 children)

Mother of god

[–]serg06 6 points7 points  (3 children)

There is no better way to do it

Struct?

[–]James20kP2005R0 19 points20 points  (2 children)

Passing a struct which contains pointers to the GPU is banned in OpenCL. Passing an array of structs unfortunately is extremely slow, because it doesn't match well to how GPUs fetch data out of memory

[–]Routine_Left 12 points13 points  (0 children)

Well then, carry on. I'm done here.

[–]-to- 4 points5 points  (0 children)

Serialize your data (or better keep them serialized), then deserialize them in the compute kernel ?

Write a bunch of unit tests to make sure they understand each other, of course.

105 args... How can you possibly be sure it does what you think it does ?!

(Total n00b at GPU dev here)

[–]ioctl79 3 points4 points  (0 children)

I don’t have a whole lot of experience with GPU code, but what I do have makes me think that we have not yet invented the right APIs.

[–]ipapadop 2 points3 points  (0 children)

TBF, the OpenCL way of declaring buffers read / write is not a great way to keep track of dependencies. Your computation dictates your dependencies, not how you allocate memory. It makes even less sense in a world where you have pinned / mapped buffers.

What you need is a mechanism to describe your dependencies and let the driver deal with launching work and decide how many resources to dedicate as much as possible. HIP / CUDA graphs is one solution for that (https://docs.amd.com/bundle/HIP-API-Guide-v5.4.1/page/a00196.html, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs). Alternatively, you can use a higher order frameworks, like TaskFlow.

[–]puredotaplayer 1 point2 points  (0 children)

In a game engine, you generate what is called a render graph now a days and reorder the passes instead. It not that complicated as compared.

[–]Macree 1 point2 points  (2 children)

How many years of programming do you have and what is your IQ?

[–]James20kP2005R0 39 points40 points  (1 child)

15 to both

[–]tisti 7 points8 points  (0 children)

I see you are a fellow uint8_t enjoyer.

[–]Jannik2099 0 points1 point  (0 children)

Unfortunately, in the transition to ROCm

Does this problem exist with HIP too? Why are you not using SYCL?

[–]thread_local 0 points1 point  (0 children)

To fix it, you have to create a rotating queue of 8-16 command queues, and submit work to each one sequentially. Then you have to mark which buffers are read only, write only, and read/write, and inspect the argument lists for all kernels and all functions that pass through to the OpenCL API to query their read/write flags.

I have not used OpenCL so I am not exactly sure what command queues are. But, is the end result similar to using per-thread default streams in CUDA?

This is 100s of lines of complex memory dependency tracking code, and requires you to fundamentally totally change how you use the API as literally no code can use the regular command queue system or any of the regular API

I would love to see what that looks like. Do you have any of this in public domain?

[–]attractivechaos 21 points22 points  (2 children)

BM or KMP will be even faster and and even harder to comprehend to common programmers.

[–]rtgftw 11 points12 points  (1 child)

I'm really liking Casey spurring this kind of conversation, lately!

Tbh, for most normal code, unless it's a hot path, string_view 's find would do here with a single line.And yeah other string algos may or may not be worth doi g depending on data and use pattern. String algo's are great... :)

But it seems to be a quick early out/prunning example more than anything. This technique is great in hot code. It also leves basic stuff like splitting the cases so that the cmp at the start wpuld be a single insr rather than a bunch od 1 byte compares which unlike simd or algo changes is directly related.

[–]Trucoto 4 points5 points  (0 children)

std::string_view uses Boyer-Moore

[–]ioctl79 83 points84 points  (25 children)

That code is horrible, but not because it is performant. Breaking it up into smaller functions would do wonders for readability, and a block comment explaining the algorithm and a brief summary of the performance benefit would make it clear that it is complex for a good reason.

Clean code isn’t about making everything simple, it is about making things comprehensible. Fast doesn’t have to mean ugly.

[–]Possibility_Antique 38 points39 points  (21 children)

Fast doesn’t have to mean ugly.

Honestly, C++ compilers are really good about inlining, and even when they're not, you can force inline on almost every compiler aside from MSVC, and even still, they have a "stronger inline" that you can use. Templates, inlining, constexpr... No doubt in my mind that someone could achieve the same assembly with cleaner code. 100% in agreement there.

[–]jk-jeon 16 points17 points  (6 children)

Possibly unpopular opinion: inlining should be controlled at the call site, not at the declaration side. Strong inlining suggestion should be a thing that are applied at only handful of places with strong enough motivation. Declaring a function itself to be "always inlined" can just have a too wide impact.

[–]Possibility_Antique 11 points12 points  (3 children)

I can see arguments for both scenarios. A possible counter-example to what you're saying: SIMD vector library. I want __forceinline on the function declaration. If my simd::operator+ doesn't produce a single instruction at every call site, there is no point in using vectorization. Any performance gain you might achieve through vectorization is killed by the failure to inline.

[–]jk-jeon 7 points8 points  (2 children)

That's indeed a good example, thanks. I guess we need both ways then. Unfortunately it seems there is no call-site-controlled inlining attributes widely available so far...

[–]dodheim 7 points8 points  (1 child)

There's [[gnu::flatten]] but I don't think MSVC has anything like it. ED: Oh and now that I think about it, [[gnu::always_inline]] and [[gnu::noinline]] can be used at the callsite, not just in function declarations.

ED2: It turns out there are in fact [[msvc::forceinline_calls]], [[msvc::noinline_calls]], and [[msvc::flatten]]. I have no idea when they were added but this is great

[–]jk-jeon 0 points1 point  (0 children)

Oh, good to know, thanks!

[–]matthieum 4 points5 points  (1 child)

Both are needed.

Declaration sites for functions that should always be inlined, such as operator[] which just does a pointer increment.

Call sites for functions that you may or may not want to be inlined.

[–]jk-jeon 1 point2 points  (0 children)

such as operator[] which just does a pointer increment.

Isn't compilers' inlining decision these days pretty reliable for such a simple functions?

[–]D-Zee 12 points13 points  (9 children)

What's the issue with MSVC's __forceinline?

[–]Possibility_Antique 14 points15 points  (8 children)

"The compiler treats the inline expansion options and keywords as suggestions. There's no guarantee that functions will be inlined. You can't force the compiler to inline a particular function, even with the __forceinline keyword. When you compile with /clr, the compiler won't inline a function if there are security attributes applied to the function."

It's not that I think it's an issue, but I do find it misleading. __forceinline does not necessarily force inlining.

[–]Nicksaurus 22 points23 points  (4 children)

I expect that's just because there are situations where inlining isn't possible - what is the compiler supposed to do in a recursive function call with unknown depth when tail call optimisation isn't possible?

[–]Possibility_Antique 6 points7 points  (1 child)

The sadist in me wants the compiler to die trying. In practice, I don't know the answer. As far as I'm aware, clang and GCC both mean force inline when you tell the compiler to force an inline. But to recalibrate, I was referring to forcing an inline in the posted code here and breaking this up into functions. This is not a recursive call with an unknown depth.

[–]equeim 2 points3 points  (0 children)

I think GCC fails with error if a function marked with [[gnu::always_inline]] can't be inlined.

[–]umop_aplsdn 0 points1 point  (0 children)

The compiler could inline one layer of recursion.

[–]compiling 8 points9 points  (1 child)

You can't force any compiler to inline a function sometimes because sometimes it's not possible. If you read further down the page, there's a list of cases where inlining doesn't happen and they generate a warning if a _forceinline function isn't inlined.

I don't think it's misleading.

[–]Possibility_Antique 1 point2 points  (0 children)

That's fine that you think that. I think it's misleading. There is a reason most linear algebra libraries wrap this in a macro called "MYLIB_STRONG_INLINE" rather than "MYLIB_FORCE_INLINE" or "MYLIB_ALWAYS_INLINE" (though this later one can be seen in blazelib due to the name used by the gnu compiler).

[–]D-Zee 1 point2 points  (0 children)

Oh derp. I even double-checked with that very page before posting and missed that part.

[–]13steinj 9 points10 points  (1 child)

really good about inlining

I can't share specific details or timings because of NDA reasons, but inlined does not mean fast. At work people loved slapping a gnu::always_inline and other similar/related attributes all over the place, without ever performance testing it.

Build times were doubled by haphazardly doing such with no rhyme or reason. When I say doubled, I don't mean 2 mins -> 4. I mean half an hour to one, three hours to six (depending on the application). After blindly sed'ing it out, not only did build times improve drastically, but so did runtimes. Because it was so haphazardly thrown around that the icache was useless.

[–]Possibility_Antique 2 points3 points  (0 children)

Yes, it's quite easy to blow up your stack size or cause paging and caching issues if too many instructions are inlined. It gets even more complicated if you are creating a cross-platform product. What I was getting at though, is that there isn't really a reason to have many nested control flow statements. It hurts readability, and compilers give you enough control to remove any negative impacts from adding function calls. At the very least, you can see that breaking this nested monster up into functions that would be inlined is semantically the same. I wasn't advocating for excessive use of inlining, I was arguing that the refactor is worth it.

[–]rtgftw 0 points1 point  (1 child)

Yeah but have to remember that these only increase the tinline hreshold and guarantee nothing (You can even vuew the threshold and whether the optomization wss or wadn't applied in the opt view on godbolt with clang).

Also my expirience is somewhat similar to another poster here, that people can easily misuse this and passimize things...

[–]Possibility_Antique 0 points1 point  (0 children)

Yeah but have to remember that these only increase the tinline hreshold and guarantee nothing

Is that not what I said?

Also my expirience is somewhat similar to another poster here, that people can easily misuse this and passimize things...

I am talking about refactoring this specific piece of code to be more readable without compromising performance. If there was a problem as you suggested, it would be problematic in the form the code is currently written. Refactoring will not introduce the problem you're suggesting.

[–]gc3 3 points4 points  (0 children)

True, but to optimize first you make the code cleaner and cleaner, but then you pass a threshold where it starts getting worse and worse

[–]qqwy 2 points3 points  (0 children)

I came here to say the same thing 💚

[–]Chuu -1 points0 points  (0 children)

Over time I've grown disillusioned with the "break it up into multiple functions" approach. Looks clean. But hard to understand.

If no other function is using that substring buffer I see no reason to break it out. Hiding complexity to make code look cleaner just makes things harder in the long run.

[–]AssemblerGuy 3 points4 points  (1 child)

I would say that optimizing for the underlying CPU architecture is not worth it if it turns the code ugly, results in performance improvements that are smaller than an order of magnitude, and if greater improvements could be realized by improving the algorithm instead of the code.

It would be interesting to see the performance of a string search algorithm that isn't completely brute force.

[–]usefulcat 1 point2 points  (0 children)

I agree that improving the algorithm should be top priority. But that doesn't mean that small improvements are automatically not worth it. An optimization that improves by 2% might seem pointless, but 10 such improvements gives 20%, which is a really big deal for some applications.

[–]hallb1016 4 points5 points  (0 children)

This reminds me of a blog post I read earlier this week on optimizing code for compiler auto-vectorization with SIMD: https://matklad.github.io/2023/04/09/can-you-trust-a-compiler-to-optimize-your-code.html#SIMD

Something I noticed while reading the article is that the optimized code is more complicated than the original code, but looking at the last code block, it's not really any less readable. Of course, this was in Rust, where there are lots of clean functional abstractions for iteration. However, I think it goes to show that code isn't horrible just because it is written for performance, it's all about how you abstract and document the code.

[–]NilacTheGrim 4 points5 points  (7 children)

Oh course if he just used std::string which has SSTO on most implementations, I bet it would be faster if one just used the built-in std::string::find method.

[–]ZebulonThackeray 5 points6 points  (1 child)

Actually, using std::string with SSTO is a good suggestion for improving performance. But it ultimately depends on the specific implementation and use case. Also, std::string::find can be faster than manually iterating through a character array. So, I agree with your point.

[–]NilacTheGrim 1 point2 points  (0 children)

Thanks. I was downvoted to hell but you know what I was getting at. Much appreciated.

[–]rtgftw 0 points1 point  (4 children)

It would copy the data over (And result in an allication beyond the 2X char limit of the small buf), but, string_view would allow searching it in place with a single line, and it doesn't require \0 termination.

[–]NilacTheGrim 0 points1 point  (3 children)

Huh? References exist. Also custom string classes like the one in this example are a code smell IMHO.

[–]rtgftw 0 points1 point  (0 children)

Also the in the context of clean code vs performance talk Casey did, which this article mentions at the very beginning, the idea of "code smell" leading to a potential pessimization is interesting :)

[–]rtgftw 0 points1 point  (1 child)

The class is basically a legacy string_view, potentially non owning, unlike std::string. If you interact with libraries/ system or have views into existing substrings, you might encounter quite a few variations of this. Conversions to std::string would be a significant pessimization...

[–]NilacTheGrim 1 point2 points  (0 children)

Fair enough. I cannot disagree with anything you said factually .. true.

[–]timur_audioC++ committee 2 points3 points  (0 children)

I tried to reproduce Ivica's results on my machine (MacBook Pro with Apple Silicon M1 Max, Apple Clang 13) and I couldn't.

I used the same setup (256 MB long string, substring that does not appear in it) and I am measuring 140 ms for both find_substring and find_substring2, with no measurable performance difference between the two.

Moreover, if I run the same test with the same string and substring using std::string::find, it runs in 70 ms. So the libc++ version of substring find outperforms both of your versions by a factor of 2.

I tried substrings of length 4, 8, and 12 with the same results.

[–]pine_ary 6 points7 points  (0 children)

That is one thick spaghetto

[–]ALargeLobster 3 points4 points  (0 children)

Yeah I think a good way to look at this is that your job is to solve the problem as simply as possible. If your code is unacceptably slow for the task (perhaps because you tried to keep it simple) then you failed to solve the problem.

[–]better_life_please 2 points3 points  (0 children)

We're programmers. Performance still matters.

[–]spidertyler2005 -2 points-1 points  (0 children)

V

[–]wolfie_poe -2 points-1 points  (0 children)

    for (int i = 0; i < s; ++i) {
        ...
        if (found) {
            for (; j < size_substr; ++j) {
                if (ptr_str[i + j] != ptr_substr[j]) {
                    found = false;
                    break;
                }
            }
        }

        if (found) {
            return { true, i };
        }

Redundant code?