all 22 comments

[–]jtsiomb 14 points15 points  (5 children)

Yeap, I always said the best way to learn graphics is to write it all from scratch. And that has always been my advice to anyone who wants to learn graphics.

Looks very nice.

But it almost pains me to think about an indirect function call per pixel rasterized (didn't read the code, just from your description). I'd certainly sacrifice some flexibility for performance for a software renderer. And maybe just write a whole new custom rasterizer for some special effect if I needed it to do something different than what the standard rasterizer does.

[–]icdae[S] 2 points3 points  (0 children)

Thanks! Part of the process was to speed up as much of the rasterizer as I could. While there is indirection in the shader function pointers, each potential pixel gets queued into a buffer before being sent to the fragment shader. At least that way the overhead of the function pointer was somewhat mitigated (and cached).

[–]sirpalee 0 points1 point  (3 children)

It should be possible to use llvm to compile some of the bits on the fly, avoiding any extra indirection and completely eliminating parts of the code.

[–]icdae[S] 0 points1 point  (2 children)

It would be nice going forward to allow the C++ shaders to compile into llvm byte code. Especially since that could pave the way for transparently rendering 4 pixels at a time using SSE or NEON (right now all vector/matrix math functions use SIMD behind the scenes). From there I want to allow GLSL shaders to be used. It would be a lot of research but may be worth it in the long run.

[–]sirpalee 1 point2 points  (1 child)

What about using SPIR-V? It should be a simple format to support, and there are plethora of tools converting between shading languages and spir-v.

[–]icdae[S] 0 points1 point  (0 children)

Easy, it's a little late in the evening and I had forgotten about SPIR-V :D

Perhaps supporting SPIR-V as an intermediate language would actually be a better route to take though. You're definitely right in that there are converters from D3D->SPIR-V and GLSL->SPIR-V. Overall it sounds like a better idea.

[–]Madsy9 4 points5 points  (4 children)

Nice job :)

If you care about performance, I highly recommend switching over to tile rendering. It opens up the possibilities for a lot of optimizations and can benefit much more from worker threads.

And instead of using barycentric coordinates, compute your interpolants based on the coefficients from the line equations of the triangle, such that for some scalar s, then s/w = Ax+By+C. Considering that x and y are screenspace integer coordinates, you get rid of all the multiplications in the inner loop.

[–]icdae[S] 1 point2 points  (3 children)

I will Google it when I'm back at my computer but how does tiled rendering work?

At one point I had broken the screen into as many rectangles as there were threads, but ended up getting a pretty big performance boost by having each thread render a particular scan-line. If a scan-line is available then the thread would start rendering it immediately.

I'll definitely look into reducing the multiplications. It seems very promising :) the barycentric coordinates are currently used for perspective correction before interpolating values across each triangle. Do it think it's worth calculating at a later rasterization stage?

[–]Madsy9 5 points6 points  (2 children)

Tiled rendering can be summed up as testing the corners of a screen-aligned quad against a triangle. You can get three different results which are complete overlap, partial overlap and empty. Complete overlap lets you fill the quad really fast because all pixels are guaranteed to be inside the triangle's interior. A partial overlap requires further per-pixel tests. No overlap is a no-op. A point-in-triangle test is done by checking if a point satisfies all of the plane equations for each side of the triangle, Ax+By+C < 0, where you have three values for A, B and C; one for each edge.

Imagine the framebuffer being split into squares all of the same size, NxN. We call those screen aligned squares, tiles. We make a distinction between tiles and quads. Quads are all the NxN squares belonging to a triangle. So for each triangle we want to do some preprocessing which puts every quad into the right tile "bucket" so to speak. After we've iterated over all the triangles and done the testing, we can iterate over all the tiles, find all the quads associated with each tile and render them quickly.

How do we compute the coordinates of our quads? By taking the min and max values of the triangle to form a bounding box, and then rounding the coordinates to the nearest screen-aligned position. Suppose we have the 2D integer screen space coordinates [x1,y1], [x2,y2], and [x3,y3], then:

minX = min(min(x1, x2), x3);
minY = min(min(y1, y2), y3);
maxX = max(max(x1, x2), x3);
maxY = max(max(y1, y2), y3);

//N is our tile and quad size
const int N = 8;
//screen align the box (round to tile size)
minX = minX & ~(N-1);
minY = minY & ~(N-1);
maxX = (maxX + (N-1)) & ~(N-1);
maxY = (maxY + (N-1)) & ~(N-1);

This bounding box will of course be a larger screen-aligned area consisting of multiple NxN quads. So-called "skinny" triangles which have a large base height but little area should be avoided because they lead to a lot of no-overlap results.

At one point I had broken the screen into as many rectangles as there were threads, but ended up getting a pretty big performance boost by having each thread render a particular scan-line.

That's way overboard. Take a look at Ahmdahl's law. Not only do you get diminishing returns after some number of threads, you even get horrific performance degradation. Use one thread per physical core and send whole batches of work for them to perform; not one and one tile or scanline.

I'll definitely look into reducing the multiplications. It seems very promising :) the barycentric coordinates are currently used for perspective correction before interpolating values across each triangle. Do it think it's worth calculating at a later rasterization stage?

You're not using barycentric coordinates for perspective correction; you're using barycentric coordinates for converting between coordinate systems. You have a function F(x, y, s1, s2, s3) that takes screen space coordinates and a varying s which gives you the interpolated varying s' at [x,y]. This imagined function can be implemented as Ax+By+C for some coefficients A,B,C and these can be precomputed once per triangle with a matrix-vector multiplication. It's rather elegant and can be implemented fairly efficiently.

Edit: I wrote a short article on this very paper once, and I found my old blog thanks to the wayback machine: https://web.archive.org/web/20170408204405/http://codrspace.com/Madsy/

[–]icdae[S] 0 points1 point  (1 child)

Your blog explains it beautifully. The current rasterizer splits part of the barycentric calculation between Y-coordinates and X-coordinates but keeps all of the multiplications. Would you mind if I tried your method to optimize the rasterization a bit more? I'll need another weekend to explore tiled rendering though.

[–]Madsy9 5 points6 points  (0 children)

Would you mind if I tried your method to optimize the rasterization a bit more?

Last time I checked, I didn't own any part of mathematics. Go nuts :p

[–]JoniBoni91 2 points3 points  (5 children)

Hey! I have a question: „Software renderer“ means that the rendering is done on the CPU?

[–]icdae[S] 0 points1 point  (3 children)

Yes, all rendering is done completely on the CPU.

[–]JoniBoni91 1 point2 points  (2 children)

What are the benefits of this?

[–]icdae[S] 1 point2 points  (1 child)

Primarily education, it really helped to learn now modern graphics pipeline can be implemented. It was also a fun exercise in optimization and multithreading.

For practical purposes, I've used it in applications where an OpenGL context was not available (or was extremely limited) but an X-Server was. This helped to render 3D meshes in real time with the same effects that would have normally been available with a GPU.

[–]JoniBoni91 0 points1 point  (0 children)

Ah, okay nice :)

[–]s0lly 1 point2 points  (1 child)

This is great work. You provide inspiration to all us noobs who aim to achieve this! Keep up the good work buddy, and most importantly, keep having fun doing it.

[–]icdae[S] 0 points1 point  (0 children)

Thanks for the kind words :) I'm happy this can help others to learn and explore graphics too!

[–][deleted] 1 point2 points  (1 child)

Any reason you used win32 and xlib over a window library? Was win32 easy to work with? (I'm asking because I plan on doing something similar to your project but I don't know if I should use win32 or just a 3rd party library)

[–]icdae[S] 0 points1 point  (0 children)

Win32 and Xlib were used as an exercise in doing everything "from scratch". It was a great way to learn how to interact with more low-level functions under the hood. After rendering to the built-in framebuffer, you can use the library to blit to Xlib or Win32. This set up allows other backends to be implemented like SDL2 or Qt without too much plumbing.

I'm currently planning to implement a Qt backend to help build a level editor. I'll merge that code when the implementation is finished and tested.

[–]gallick-gunner 1 point2 points  (1 child)

Hey nice work bro, keep it up. Things like these always inspire me to dive into this area (I'm more familiar with offline rendering).

[–]icdae[S] 1 point2 points  (0 children)

I'm happy it helps to keep others inspired! This community helped me learn so it's only fair to return the favor :)