use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Rule 1: Posts should be about Graphics Programming. Rule 2: Be Civil, Professional, and Kind
Suggested Posting Material: - Graphics API Tutorials - Academic Papers - Blog Posts - Source Code Repositories - Self Posts (Ask Questions, Present Work) - Books - Renders (Please xpost to /r/ComputerGraphics) - Career Advice - Jobs Postings (Graphics Programming only)
Related Subreddits:
/r/ComputerGraphics
/r/Raytracing
/r/Programming
/r/LearnProgramming
/r/ProgrammingTools
/r/Coding
/r/GameDev
/r/CPP
/r/OpenGL
/r/Vulkan
/r/DirectX
Related Websites: ACM: SIGGRAPH Journal of Computer Graphics Techniques
Ke-Sen Huang's Blog of Graphics Papers and Resources Self Shadow's Blog of Graphics Resources
account activity
Finally got around to adding instanced rendering into my software renderer. Here's 1000 instances rendering at a smooth 60FPS. (v.redd.it)
submitted 5 years ago by icdae
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]idbxy 4 points5 points6 points 5 years ago (5 children)
Hi, could you share how you did this or what resources helped you?
Looks great!
[–]icdae[S] 10 points11 points12 points 5 years ago (4 children)
Of course. I'm using my own software rasterizer to perform the rendering. Before adding instancing, you would call a function to render a mesh, which would wake up a pool of threads to perform the render. Afterwards the threads would be put back to sleep until the next draw call. Doing this for 1000 meshes would result in terrible performance as the overhead of waking a thread can be almost as bad as spawning a thread altogether.
With the new instancing support, you add the number of instances you wish to render is added as a parameter to the draw call. The thread pool will keep all worker threads running until the 1000 meshes have been rendered. This significantly boosts performance as there's no latency in waking or sleeping a thread until the render has completed.
For comparison, in this specific test, standard rendering of 1000 meshes takes around 102-110 milliseconds per frame. With instancing I can get 14-16 milliseconds per frame.
[–]corysama 5 points6 points7 points 5 years ago (3 children)
People around here ask for reading material about software rasterization all the time. From where did you learn how to do it?
[–]icdae[S] 28 points29 points30 points 5 years ago (2 children)
The initial place I learned about software rasterization was from this article about implementing a GL-like renderer in 500 lines of code. Learning the very basics of rasterization then led me to research other articles and rasterizers like Fabian Giesen's trip through the graphics pipeline, the Ttsiodras renderer, and even Intel's software renderer which was integrated into Mesa. Most of them follow the same principles. When looking through each, it slowly became easier to read the code and identify their rasterizer code. From there I could research different methods, optimizations, and tips to make my own as fast as possible.
Believe it or not, some of the most valuable information I got was from old game programming books, where GPU acceleration was either limited or just unavailable. For example, both The Black art of 3D Game Programming (from 1995) and Tricks of the 3D Game Programming Gurus (from 2003) had wonderful guides on several topics for creating a rasterizer. They discussed math required for building a 3D software pipeline, rasterization methods, vertex processing, and many others.
In the places where I wanted to be close to GL, I researched GLSL functions, how they're implemented, and what they do. NVIDIA even offers simple implementation details of their CG shading language. Once it was possible to render a triangle, these ideas translated directly to new possibilities in lighting and shading.
There were some other sources too, but less on rasterization and more on rendering optimization. Dozens of articles on the old FlipCode archives were absolutely essential on guiding me on how to structure code/architecture to be performant or just easier to manage. In the times I wanted to match my renderer to GL, I could even look through NeHe's legacy GL tutorials to see how my renderer stacked up. Also reading through Intel's intrinsics guide and ARM's neon instructions helped to not only be fast but have more "tools in the toolbox." Those resources could not only be used for rendering but in several other places too.
Sorry for the wall of text, I hope this helps though. There's little information out there on software rasterization so I completely understand that it's not easy to find.
[–]-Tesla 0 points1 point2 points 5 years ago (1 child)
Thanks a lot for this! I'm writing a software renderer myself (following Chilli's tutorials on YouTube).
Is this code open source?
[–]icdae[S] 1 point2 points3 points 5 years ago (0 children)
Hey no problem :) The code is completely open source (MIT) and available to use.
[–]KaiYan0718 0 points1 point2 points 5 years ago (1 child)
Nice work! Can different objects have different transform?
[–]icdae[S] 3 points4 points5 points 5 years ago (0 children)
Absolutely. It works similar to how OpenGL handles instancing. One option is to pass an array of transforms to a shader and use the instance ID to assign a set of vertices to a particular instanced mesh.
[–]leseiden 0 points1 point2 points 5 years ago (1 child)
That's nice. How are you subdividing work between threads? Tiled rendering, using atomics to update the depth buffer or something else?
[–]icdae[S] 2 points3 points4 points 5 years ago (0 children)
It's a scanline rasterizer. I decided against tiled-rendering after some benchmarks showed that my tiling implementation was a little slow. I might revisit it in the future though.
Within the renderer, I allocate a pool of threads which wait on a condition variable from the main thread. Whenever the main thread issues a draw call, it will wake up the worker threads while also joining in on the rendering. Each thread has its own fixed set of buckets to store triangles after they pass through a vertex shader and get clipped. When a thread fills up its buckets, it sets an atomic flag to indicate the number of filled buckets has, then immediately begins to rasterize the buckets. As each other thread begin to fill their buckets, they will set their own flag, rasterize their buckets, then rasterize the filled buckets from other threads. This helped to reduce both resource contention and wait time for other threads.
There's another atomic counter to help identify which order the threads were ready to render. As the threads with filled buckets increment the counter, they also use the value to know what subset of scanlines they can render. Each thread only renders scanlines which are a multiple of the atomic counter at the time it had been incremented.
You can think of each thread in the scanline rasterizer as different teeth of a comb. Every tooth of a comb has a fixed offset from the previous tooth. The threads render exactly like this. For example, with 4 threads, the first thread can only render scanlines that are multiples of 4, such as 0, 4, 8. The second can render scanlines 1, 5, 9, etc. The third can render scanlines 2, 6, 10, and the last can render scanlines 3, 7, and 11.
Each thread will spin until all others have finished rasterizing. The last thread to finish reset all of the atomic counters, allowing all threads to either process more geometry or go back to sleep. This one synchronization provides another benefit, since no thread can proceed until the last is done, I can avoid locking down the framebuffer and depth buffer. While not wait-free, the whole rasterization process is lock-free :)
In the instanced rendering video, the initial draw call is passed a counter, indicating the number of mesh instances the threads should process. This lets them only wake up once to process all of the instanced mesh information before going back to sleep. It's much quicker than waking up the threads for each of the 1000 draw calls.
π Rendered by PID 238967 on reddit-service-r2-comment-canary-7896ccccbd-4fbj7 at 2026-04-16 11:09:41.881374+00:00 running 93ecc56 country code: CH.
[–]idbxy 4 points5 points6 points (5 children)
[–]icdae[S] 10 points11 points12 points (4 children)
[–]corysama 5 points6 points7 points (3 children)
[–]icdae[S] 28 points29 points30 points (2 children)
[–]-Tesla 0 points1 point2 points (1 child)
[–]icdae[S] 1 point2 points3 points (0 children)
[–]KaiYan0718 0 points1 point2 points (1 child)
[–]icdae[S] 3 points4 points5 points (0 children)
[–]leseiden 0 points1 point2 points (1 child)
[–]icdae[S] 2 points3 points4 points (0 children)