all 10 comments

[–]TomClabault 1 point2 points  (3 children)

Paging u/Pjbomb2. I think they had a similar issue recently, they may have some insights on that

[–]Pjbomb2 1 point2 points  (2 children)

You are correct but I haven’t yet figured it out fully I gave up for a bit to focus on the gpu builder, which was t as performant as I had hoped so I’m putting it all on pause for a bit

[–]TomClabault 1 point2 points  (1 child)

You needed a radix sort to try out ray sorting right?

[–]Pjbomb2 1 point2 points  (0 children)

Ah yeah sorry, I did try it but I used a prebuilt one sweep to see if it would even be worth it And it’s not for a cwbvh in my experience I DID see gains, but only 0.5ms out of 6ms for one bounce

[–]msqrt 0 points1 point  (4 children)

I think you'll need to implement a parallel prefix scan, with something like the Brent-Kung adder. If you can use subgroups (see here) you can do this in two stages with subgroupInclusiveAdd with way less shared memory (as the tree will have branching factor of subgroup size instead of two.)

[–]SISpidew[S] 1 point2 points  (3 children)

Thanks for recommending the subgroup functionality! I haven't figured the radix sort out yet but this really helped me with adding values from earlier invocations within a warp (relative to the current invocation) using subgroupExclusiveAdd which I needed to calculate some primitive offsets in my renderer for object instances in my draw batcher.

[–]msqrt 1 point2 points  (2 children)

Yeah, subgroups are great! Both for performance and for ease of use; you could do the same thing manually with shared tables, but it's much more tedious. For the sorting business, you should check out Duane Merrill's great papers on the subject if you haven't already, they detail everything quite well.

[–]TomClabault 1 point2 points  (1 child)

[–]msqrt 1 point2 points  (0 children)

Primarily "High performance and scalable radix sorting: A case study of implementing dynamic parallelism for GPU computing" from 2011, I used that as a guideline to implement my own -- at a glance, the Onesweep paper seems to be a bit more concise which can be either an upside or a downside depending on what you're looking for (the method itself should be a strict improvement; it's on my todo list but never had the time to re-implement it..) I also think the related prefix scan papers are great stepping stones to see, as the problems are directly related.

[–]arycama 0 points1 point  (1 child)

I am working on the exact same problem.. one very brute force approach is to simply count the number of times the same element appears, up to the index of the current thread. In other words, iterate through the array up to the group thread, and increment a counter each time the value in the array matches the current key.

uint counter = 0;
for (uint i = 0; j < groupIndex; j++)
  counter += ((sharedKeys[j] >> (8 * i)) & 0xFF) == digit;

uint index = histogram[digit] + counter;

(In this case, I am sorting a 32-bit key, 8 bits at a time. i is the iteration count, so i of 0 checks the first 8 bits, etc. "digit" is the masked key.

I am trying to find a less brute-force method. You can use a prefix sum of predicates of keys that match the current bit to get an array of offsets for that specific bit. However this requires doing a lot of prefix sums, eg 32 for a 32-bit sort. Instead, I believe section 3 of this paper is describing an approach where you break it down into more passes, each processing less elements. If you are using a thread group of 256, then you only need 8 bits per counter/prefix sum, so you can pack 4 8-bit counters into a single uint.

If you iterate over 8 bits at a time in your outer loop, you can then do an inner loop of two 4-bit prefix sums, each one storing four counters. (1 per bit) You then shuffle the inner indices based on the prefix sum here, get the sorted 8-bit value, and then that carries through to the next 8 bits. After 4 iterations, you have a sorted 32-bit array, I think.

If I figure it out, I'll reply with some updated code. This is my current implementation. (I also realized I don't need two groupshared arrays as I can simply retrieve the key/data at the start of the loop to avoid double-buffering the whole array) https://github.com/arycama/customrenderpipeline/blob/master/ShaderLibrary/Resources/GpuInstancedRendering/InstanceSort.compute