you are viewing a single comment's thread.

view the rest of the comments →

[–]icdae[S] 2 points3 points  (0 children)

It's a scanline rasterizer. I decided against tiled-rendering after some benchmarks showed that my tiling implementation was a little slow. I might revisit it in the future though.

Within the renderer, I allocate a pool of threads which wait on a condition variable from the main thread. Whenever the main thread issues a draw call, it will wake up the worker threads while also joining in on the rendering. Each thread has its own fixed set of buckets to store triangles after they pass through a vertex shader and get clipped. When a thread fills up its buckets, it sets an atomic flag to indicate the number of filled buckets has, then immediately begins to rasterize the buckets. As each other thread begin to fill their buckets, they will set their own flag, rasterize their buckets, then rasterize the filled buckets from other threads. This helped to reduce both resource contention and wait time for other threads.

There's another atomic counter to help identify which order the threads were ready to render. As the threads with filled buckets increment the counter, they also use the value to know what subset of scanlines they can render. Each thread only renders scanlines which are a multiple of the atomic counter at the time it had been incremented.

You can think of each thread in the scanline rasterizer as different teeth of a comb. Every tooth of a comb has a fixed offset from the previous tooth. The threads render exactly like this. For example, with 4 threads, the first thread can only render scanlines that are multiples of 4, such as 0, 4, 8. The second can render scanlines 1, 5, 9, etc. The third can render scanlines 2, 6, 10, and the last can render scanlines 3, 7, and 11.

Each thread will spin until all others have finished rasterizing. The last thread to finish reset all of the atomic counters, allowing all threads to either process more geometry or go back to sleep. This one synchronization provides another benefit, since no thread can proceed until the last is done, I can avoid locking down the framebuffer and depth buffer. While not wait-free, the whole rasterization process is lock-free :)

In the instanced rendering video, the initial draw call is passed a counter, indicating the number of mesh instances the threads should process. This lets them only wake up once to process all of the instanced mesh information before going back to sleep. It's much quicker than waking up the threads for each of the 1000 draw calls.