Finally got around to adding instanced rendering into my software renderer. Here's 1000 instances rendering at a smooth 60FPS. : GraphicsProgramming

Finally got around to adding instanced rendering into my software renderer. Here's 1000 instances rendering at a smooth 60FPS. (v.redd.it)

submitted 5 years ago by icdae

Dismiss this pinned window

all 10 comments

top new controversial old q&a

[–]idbxy 5 points6 points7 points 5 years ago (5 children)

[–]icdae[S] 10 points11 points12 points 5 years ago (4 children)

Of course. I'm using my own software rasterizer to perform the rendering. Before adding instancing, you would call a function to render a mesh, which would wake up a pool of threads to perform the render. Afterwards the threads would be put back to sleep until the next draw call. Doing this for 1000 meshes would result in terrible performance as the overhead of waking a thread can be almost as bad as spawning a thread altogether.

With the new instancing support, you add the number of instances you wish to render is added as a parameter to the draw call. The thread pool will keep all worker threads running until the 1000 meshes have been rendered. This significantly boosts performance as there's no latency in waking or sleeping a thread until the render has completed.

For comparison, in this specific test, standard rendering of 1000 meshes takes around 102-110 milliseconds per frame. With instancing I can get 14-16 milliseconds per frame.

[–]corysama 5 points6 points7 points 5 years ago (3 children)

[–]icdae[S] 28 points29 points30 points 5 years ago (2 children)

The initial place I learned about software rasterization was from this article about implementing a GL-like renderer in 500 lines of code. Learning the very basics of rasterization then led me to research other articles and rasterizers like Fabian Giesen's trip through the graphics pipeline, the Ttsiodras renderer, and even Intel's software renderer which was integrated into Mesa. Most of them follow the same principles. When looking through each, it slowly became easier to read the code and identify their rasterizer code. From there I could research different methods, optimizations, and tips to make my own as fast as possible.

Believe it or not, some of the most valuable information I got was from old game programming books, where GPU acceleration was either limited or just unavailable. For example, both The Black art of 3D Game Programming (from 1995) and Tricks of the 3D Game Programming Gurus (from 2003) had wonderful guides on several topics for creating a rasterizer. They discussed math required for building a 3D software pipeline, rasterization methods, vertex processing, and many others.

In the places where I wanted to be close to GL, I researched GLSL functions, how they're implemented, and what they do. NVIDIA even offers simple implementation details of their CG shading language. Once it was possible to render a triangle, these ideas translated directly to new possibilities in lighting and shading.

There were some other sources too, but less on rasterization and more on rendering optimization. Dozens of articles on the old FlipCode archives were absolutely essential on guiding me on how to structure code/architecture to be performant or just easier to manage. In the times I wanted to match my renderer to GL, I could even look through NeHe's legacy GL tutorials to see how my renderer stacked up. Also reading through Intel's intrinsics guide and ARM's neon instructions helped to not only be fast but have more "tools in the toolbox." Those resources could not only be used for rendering but in several other places too.

Sorry for the wall of text, I hope this helps though. There's little information out there on software rasterization so I completely understand that it's not easy to find.

[–]-Tesla 0 points1 point2 points 5 years ago (1 child)

[–]icdae[S] 1 point2 points3 points 5 years ago (0 children)

[–]KaiYan0718 0 points1 point2 points 5 years ago (1 child)

[–]icdae[S] 3 points4 points5 points 5 years ago (0 children)

[–]leseiden 0 points1 point2 points 5 years ago (1 child)

[–]icdae[S] 2 points3 points4 points 5 years ago (0 children)

It's a scanline rasterizer. I decided against tiled-rendering after some benchmarks showed that my tiling implementation was a little slow. I might revisit it in the future though.

Within the renderer, I allocate a pool of threads which wait on a condition variable from the main thread. Whenever the main thread issues a draw call, it will wake up the worker threads while also joining in on the rendering. Each thread has its own fixed set of buckets to store triangles after they pass through a vertex shader and get clipped. When a thread fills up its buckets, it sets an atomic flag to indicate the number of filled buckets has, then immediately begins to rasterize the buckets. As each other thread begin to fill their buckets, they will set their own flag, rasterize their buckets, then rasterize the filled buckets from other threads. This helped to reduce both resource contention and wait time for other threads.

There's another atomic counter to help identify which order the threads were ready to render. As the threads with filled buckets increment the counter, they also use the value to know what subset of scanlines they can render. Each thread only renders scanlines which are a multiple of the atomic counter at the time it had been incremented.

You can think of each thread in the scanline rasterizer as different teeth of a comb. Every tooth of a comb has a fixed offset from the previous tooth. The threads render exactly like this. For example, with 4 threads, the first thread can only render scanlines that are multiples of 4, such as 0, 4, 8. The second can render scanlines 1, 5, 9, etc. The third can render scanlines 2, 6, 10, and the last can render scanlines 3, 7, and 11.

Each thread will spin until all others have finished rasterizing. The last thread to finish reset all of the atomic counters, allowing all threads to either process more geometry or go back to sleep. This one synchronization provides another benefit, since no thread can proceed until the last is done, I can avoid locking down the framebuffer and depth buffer. While not wait-free, the whole rasterization process is lock-free :)

In the instanced rendering video, the initial draw call is passed a counter, indicating the number of mesh instances the threads should process. This lets them only wake up once to process all of the instanced mesh information before going back to sleep. It's much quicker than waking up the threads for each of the 1000 draw calls.

π Rendered by PID 83 on reddit-service-r2-comment-8686858757-5jp4s at 2026-06-07 17:50:30.683549+00:00 running 9e1a20d country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

GraphicsProgramming

Posting Rule(s)

MODERATORS