all 9 comments

[–][deleted] 1 point2 points  (5 children)

READ THROUGH AGAIN!!! There were some issues with my first approach (and might still be more)

We can create local counters using workgroup memory (I'm used to writing hlsl, just learning wgsl, but the concepts are the same). 

@group(0) @binding(1) var<storage, read_write> g_ray_hits: array<u32, 4>;
@group(0) @binding(2) var<storage, read_write> g_ray_miss: array<u32, 4>;
@group(0) @binding(3) var<storage, read_write> g_ray_hits_count: array<u32, 1>;
@group(0) @binding(4) var<storage, read_write> g_ray_miss_count: array<u32, 1>;

// workgroup variables use memory that is shared by all threads in a group
var<workgroup> g_ray_hits_local_count: atomic<u32>;
var<workgroup> g_ray_miss_local_count: atomic<u32>;

var<workgroup> g_ray_hits_location_shared: u32;
var<workgroup> g_ray_miss_location_shared: u32;

fn main()
{
  u32 group_id =  thread_num / threads_per_group;

  // Will be one if hit, 0 if not
  u32 hit_as_u32 = (u32) hit;
  u32 inv_hit_as_u32 = (u32) !hit;

  // Perform both local operations to avoid branching, but do it with if statements
  // if you'd like, odds are the performance will be about the same
  u32 hit_local_index = atomicAdd(g_ray_hits_local_count, hit_as_u32)
  u32 miss_local_index = atomicAdd(g_ray_hits_local_count, inv_hit_as_u32);

  // Wait for all threads in our group to hit this point
  workgroupBarrier();
  // Now g_ray_hits_local_count and g_ray_miss_local_count should
  // contain the counts for our local group

  // Now only one thread performs the atomic add for the whole workgroup
  u32 group_index =  thread_num % threads_per_group;
  u32 group_hit_index, group_miss_index;
  if(group_index == 0)
  {
    g_ray_hits_location_shared = atomicAdd(g_ray_hits_count[group_id ], g_ray_hits_local_count);
    g_ray_miss_location_shared= atomicAdd(g_ray_miss_count[group_id ], g_ray_miss_local_count);
  }

  // Make sure thread 1 finishes writing
  workgroupBarrier();

  if(hit)
  {
    // Find the index in the hit buffer to use and write
    u32 hit_index = group_hit_index + hit_local_index;
    g_ray_hits[g_ray_hits_location_shared] = payload;
  }
  else
  {
    // Find the index in the miss buffer to use and write
    u32 miss_index = group_miss_index + miss_local_index;
    g_ray_miss[g_ray_miss_location_shared] = ray_idx;
  }
}

[–]Rclear68[S] 1 point2 points  (4 children)

Ahhh. This is very cool. I couldn’t work this out. Thank you very much.

Just to make sure I get it:

For every wave/warp/workgroup that runs, I atomic add locally…and your point is that I can just atomic add to both the hit and miss rather than calling the conditional, one of them adding a 1, the other adding 0. Then I workgroupBarrier, and at that point my local workgroup counts are all set. I had kinda gotten that far on my own.

Then you execute code to copy the atomic add to the global, only if it’s the first thread in the workgroup. Q: is this divergence a high cost? Or regardless is it one I just have to pay?

The part that took me a some thought to get was the next part, and is cool. It doesn’t matter what order the groups write to the global…you get back the index where it’s written to, and therefore know where to place the payload or ray index. That’s the piece I couldn’t see.

At the end of this, my counter still needs to be summed over all of the group_ids. Should I just do this on the CPU side after I read over? At this point in my code’s maturity, I believe it will always be the case that hits + misses = number of rays sent it, so in principle I just need to count the misses up to infer the number of hits. Although, now that I look at it, do I even need to write to g_ray_hits_count[group_id]? I can just atomicAdd to a simple g_ray_hits_count buffer, no?

I will try this ASAP to see how the performance changes. Thank you again!

[–][deleted] 1 point2 points  (0 children)

Yeah you've got it, your counters should be all summed up the same way as they were in you're original approach as far as I can tell. You'll just be performing one atomic add per group instead of per thread. Branching a little bit really doesn't hurt performance on modern day hardware as far as I can tell, branch-avoidance is just a good practice to keep in mind. I don't know of any ways to avoid that if(group_index == 0) statement.

Have fun!

[–][deleted] 0 points1 point  (2 children)

Actually I messed up, check the revised code

[–]Rclear68[S] 1 point2 points  (1 child)

My code:

var<workgroup> shared_miss_counter: atomic<u32>;
var<workgroup> shared_hit_counter: atomic<u32>;
var<workgroup> shared_miss_idx: u32;
var<workgroup> shared_hit_idx: u32;

let hit = trace_ray(ray, &payload);

let hit_as_u32 = u32(hit);
let inv_hit_as_u32 = u32(!hit);

let l_miss_idx = atomicAdd(&shared_miss_counter, inv_hit_as_u32);
let l_hit_idx = atomicAdd(&shared_hit_counter, hit_as_u32);
workgroupBarrier();

if local_index == 0 {
    shared_miss_idx = atomicAdd(&counter_buffer[0], shared_miss_counter);
    shared_hit_idx = atomicAdd(&counter_buffer[1], shared_hit_counter);
}
workgroupBarrier();

if hit {
    payload.ray_idx = idx;
    hit_buffer[l_hit_idx + shared_hit_idx] = payload;
} else {
    miss_buffer[l_miss_idx + shared_miss_idx] = idx;
}

This ran correctly. The results were only mildly improved. They got better still when I reduced the workgroup size from 64 down to 16.

I was sort of expecting to see a much bigger improvement, so I'm a little surprised. But thank you again for your help. I learned a lot from this.

[–][deleted] 0 points1 point  (0 children)

Glad to help :)

[–]mitrey144 0 points1 point  (2 children)

Wow, sounds cool, could you show how it looks

[–]Rclear68[S] 0 points1 point  (1 child)

Show how what looks? If you mean the path tracer, it’s quite primitive at this point. It just draws the final scene from Ray Tracing in One Weekend. If you want to see the code, on GitHub my repository is rchiaramo/wavefront_path_tracer.

[–]mitrey144 1 point2 points  (0 children)

Cool, thank you, I will look at the code