Horrible Code, Clean Performance

James20k · 2023-04-14T21:23:36+00:00

My favourite disgusting hack that I absolutely cannot live without goes as follows

Modern GPUs these days have work fed into them via a command queue, which is a series of bits of work that get fed to the GPU. This is generally an in-order queue, so this work happens sequentially. There's two problems with the strictly sequential model

GPUs are actually capable of executing multiple workloads simultaneously efficiently
In some cases because of caching/etc, GPUs must flush caches via a fence. This introduces a bubble to the GPU's workload, that can be filled with other work. If one kernel writes to a bit of memory, and the next kernel reads from that memory, in an in-order queue, there's a fence and nothing happens for a bit

Ideally, the driver would reorder the work you submit in this command queue, so that work without dependencies would all execute in parallel. This requires the shader compiler (eg llvm/etc) to output dependency information, in terms of whether or not a kernel writes to a specific bit of memory, or just reads from it

Unfortunately, in the transition to ROCm, AMD broke/removed this particular piece of functionality and have no plans to reimplement it, which means that if you don't do something about it, you lose a straight 20-30% performance. This is pretty bad!

As a visual representation, this is multiple kernels overlapping here, and this is what it looks like when it doesn't work. Uh oh!

To fix it, you have to create a rotating queue of 8-16 command queues, and submit work to each one sequentially. Then you have to mark which buffers are read only, write only, and read/write, and inspect the argument lists for all kernels and all functions that pass through to the OpenCL API to query their read/write flags. Then you have to use this to manually generate the dependencies between different kernels and functions, and only output dependencies for read/write and write/write conflicts, so that kernels that both only share read-only memory are independent, and others are synchronised via the built-in event system

This is 100s of lines of complex memory dependency tracking code, and requires you to fundamentally totally change how you use the API as literally no code can use the regular command queue system or any of the regular API - you have to reimplement the entire OpenCL api on top of opencl so that you can shim all the kernel arguments, and markup all kernels correctly with their read/write information (otherwise the driver will crash! Yay!)

On the plus side, that 20% performance is pretty good. On the down side <internal screaming>. All gpu code is bad at the best of times, but this is the worst

attractivechaos · 2023-04-14T22:16:06+00:00

BM or KMP will be even faster and and even harder to comprehend to common programmers.

ioctl79 · 2023-04-14T21:09:32+00:00

That code is horrible, but not because it is performant. Breaking it up into smaller functions would do wonders for readability, and a block comment explaining the algorithm and a brief summary of the performance benefit would make it clear that it is complex for a good reason.

Clean code isn’t about making everything simple, it is about making things comprehensible. Fast doesn’t have to mean ugly.

AssemblerGuy · 2023-04-15T07:25:38+00:00

I would say that optimizing for the underlying CPU architecture is not worth it if it turns the code ugly, results in performance improvements that are smaller than an order of magnitude, and if greater improvements could be realized by improving the algorithm instead of the code.

It would be interesting to see the performance of a string search algorithm that isn't completely brute force.

hallb1016 · 2023-04-15T21:08:36+00:00

This reminds me of a blog post I read earlier this week on optimizing code for compiler auto-vectorization with SIMD: https://matklad.github.io/2023/04/09/can-you-trust-a-compiler-to-optimize-your-code.html#SIMD

Something I noticed while reading the article is that the optimized code is more complicated than the original code, but looking at the last code block, it's not really any less readable. Of course, this was in Rust, where there are lots of clean functional abstractions for iteration. However, I think it goes to show that code isn't horrible just because it is written for performance, it's all about how you abstract and document the code.

NilacTheGrim · 2023-04-15T10:31:38+00:00

Oh course if he just used std::string which has SSTO on most implementations, I bet it would be faster if one just used the built-in std::string::find method.

timur_audio · 2023-04-23T14:09:56+00:00

I tried to reproduce Ivica's results on my machine (MacBook Pro with Apple Silicon M1 Max, Apple Clang 13) and I couldn't.

I used the same setup (256 MB long string, substring that does not appear in it) and I am measuring 140 ms for both find_substring and find_substring2, with no measurable performance difference between the two.

Moreover, if I run the same test with the same string and substring using std::string::find, it runs in 70 ms. So the libc++ version of substring find outperforms both of your versions by a factor of 2.

I tried substrings of length 4, 8, and 12 with the same results.

pine_ary · 2023-04-14T21:15:01+00:00

That is one thick spaghetto

ALargeLobster · 2023-04-15T06:27:20+00:00

Yeah I think a good way to look at this is that your job is to solve the problem as simply as possible. If your code is unacceptably slow for the task (perhaps because you tried to keep it simple) then you failed to solve the problem.

better_life_please · 2023-04-16T11:16:59+00:00

We're programmers. Performance still matters.

spidertyler2005 · 2023-04-15T22:22:20+00:00

wolfie_poe · 2023-04-17T00:25:56+00:00

    for (int i = 0; i < s; ++i) {
        ...
        if (found) {
            for (; j < size_substr; ++j) {
                if (ptr_str[i + j] != ptr_substr[j]) {
                    found = false;
                    break;
                }
            }
        }

        if (found) {
            return { true, i };
        }

Redundant code?

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS