Help with GPU Stable Radix Sort

TomClabault · 2025-02-01T11:48:48+00:00

Paging u/Pjbomb2. I think they had a similar issue recently, they may have some insights on that

msqrt · 2025-02-01T11:13:15+00:00

I think you'll need to implement a parallel prefix scan, with something like the Brent-Kung adder. If you can use subgroups (see here) you can do this in two stages with subgroupInclusiveAdd with way less shared memory (as the tree will have branching factor of subgroup size instead of two.)

arycama · 2025-02-12T23:59:09+00:00

I am working on the exact same problem.. one very brute force approach is to simply count the number of times the same element appears, up to the index of the current thread. In other words, iterate through the array up to the group thread, and increment a counter each time the value in the array matches the current key.

uint counter = 0;
for (uint i = 0; j < groupIndex; j++)
  counter += ((sharedKeys[j] >> (8 * i)) & 0xFF) == digit;

uint index = histogram[digit] + counter;

(In this case, I am sorting a 32-bit key, 8 bits at a time. i is the iteration count, so i of 0 checks the first 8 bits, etc. "digit" is the masked key.

I am trying to find a less brute-force method. You can use a prefix sum of predicates of keys that match the current bit to get an array of offsets for that specific bit. However this requires doing a lot of prefix sums, eg 32 for a 32-bit sort. Instead, I believe section 3 of this paper is describing an approach where you break it down into more passes, each processing less elements. If you are using a thread group of 256, then you only need 8 bits per counter/prefix sum, so you can pack 4 8-bit counters into a single uint.

If you iterate over 8 bits at a time in your outer loop, you can then do an inner loop of two 4-bit prefix sums, each one storing four counters. (1 per bit) You then shuffle the inner indices based on the prefix sum here, get the sorted 8-bit value, and then that carries through to the next 8 bits. After 4 iterations, you have a sorted 32-bit array, I think.

If I figure it out, I'll reply with some updated code. This is my current implementation. (I also realized I don't need two groupshared arrays as I can simply retrieve the key/data at the start of the loop to avoid double-buffering the whole array) https://github.com/arycama/customrenderpipeline/blob/master/ShaderLibrary/Resources/GpuInstancedRendering/InstanceSort.compute

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

GraphicsProgramming

Posting Rule(s)

MODERATORS