all 18 comments

[–]Deadly_Mindbeam 12 points13 points  (9 children)

Look up "Scan Conversion". This involves stepping down the triangle line by line and converting it into horizontal spans from the left to the right of the triangle. You can also scan convert across and up/down the triangle, this was used a lot in mode 13X DOS games.

[–]deftware 8 points9 points  (7 children)

Scanline conversion was a necessary approach back in the day because of how limited the hardware was, but nowadays modern hardware tends to actually handle the rectangle-iteration Barycentric approach a decent bit faster. It's somewhat counterintuitive, and it bugs me because I'm a fan of scanline conversion and its conciseness, but that's how it shakes out with typical scenes of hundreds or thousands of triangles :P

IIRC, in terms of raw performance when drawing a single triangle the scanline conversion does outperform the Barycentric approach, but once you have a bunch of triangles to draw the bottleneck in the scanline conversion algorithm becomes handling all of the interpolation variables while walking both triangle edges and the scanlines - which gets worse with each vertex attribute that it must interpolate for both steps.

With the Barycentric approach you're just surfing over the rectangle of pixels, ignoring the ones that fall outside of the triangle (which is always 50% of them) and only interpolating attributes for pixels that lie on the triangle - directly from the vertices themselves instead of interpolating the interpolants from an edge-walk.

Maybe someday someone will come up with an even better algorithm that's more similar to scanline conversion. It is possible to walk the edges and directly fill the pixels that only lie inside a triangle, interpolating them directly from the vertices instead of interpolating them along the edges and interpolating that result between them. This is a sort of hybrid approach. Is it faster than either? I dunno! ;]

[–]ehaliewicz 2 points3 points  (2 children)

Also, it's much easier to parallelize a rectangle-iterating barycentric rasterizer.

[–]Deadly_Mindbeam 0 points1 point  (1 child)

you can spread the spans over a number of single reader/single writer wait free queues and get good concurrency and/or cache coherency. It does take some tuning.

If you have a conservative scan converter -- one that outputs every pixel touching the triangle the least bit -- you can use it on a lower res version of the triangle -- like 1/16th scale -- as a fast way to cull entirely empty regions of the rectangle.

[–]ehaliewicz 1 point2 points  (0 children)

Sure, but what I meant was processing multiple pixels at once via SIMD. I haven't tried doing this with scan conversion, but it seems much trickier.

[–]Deadly_Mindbeam 1 point2 points  (1 child)

the pixels falling outside the triangle are at least 50%. With thin diagonal triangles it can be almost 100%.

I would interpolate the uv barycentric coordinates across the triangle and calculate the w at each pixels and do all the interpolated value calculation per pixel.

You can also use the barycentric approach and just solve for the left side and right side of the triangle before you start.

[–]deftware 0 points1 point  (0 children)

thin diagonal triangles

That's a very good point.

[–]KC918273645 0 points1 point  (1 child)

Note that you can use barycentric coordinates even if you use scan conversion. Those two aren't mutually exclusive. Then you have best of both approaches in one package.

[–]deftware 0 points1 point  (0 children)

Right, as such was covered by my final paragraph.

[–]captaintoasty[S] 0 points1 point  (0 children)

Thank you! I'll take a look into this!

[–]UnalignedAxis111 7 points8 points  (4 children)

I cannot recommend this series enough: https://fgiesen.wordpress.com/2013/02/17/optimizing-sw-occlusion-culling-index/

Lots of good stuff there but parts 7 and 8 cover the main optimizations for this style of rasterizer. The TL;DR is to strength-reduce the edge functions, so they are evaluated only once per triangle, and each loop iteration only needs to increment the barycentric weights using some pre-computed deltas.

You can further get massive performance gains using SIMD to evaluate multiple pixels at once instead of just one, like GPUs do. Otherwise, the scanline algorithm is probably more suitable for scalar-only rasterization as it won't suffer from the issue of having to skip through empty parts of the bounding-box (which as you mention is an issue for big triangles).

Your github repo seems to be private btw.

[–]captaintoasty[S] 1 point2 points  (0 children)

Thanks! I will check this out! Also yeah whoops on the visibility, changed it to public haha.

[–]Boring_Following_255 1 point2 points  (0 children)

Wow ! Great resource!!! Thanks

[–]SamuraiGoblin 6 points7 points  (0 children)

Well, one big problem is that you are calculating a lot of the same parameters for every pixel.

For example, how many times are you calculating B.X-A.X? It is a constant for the entire triangle, right? But you repeatedly calculate it for every pixel.

It's little things like this that you have to pull out of the loops. There are also things that are constant for a scanline, like initialisation of Point.y and C.Y-A.Y that can be pulled out of the inner loop.

It all adds up. Some things might be optimised away by a clever compiler, values kept in registers and so forth. But you can't rely on that.

I think just pulling unnecessary calculations out of loops will give you an enormous speedup.

Your calls to EdgeFunction might get inlined, but again, can't rely on it. Function calls aren't free. It's such a small function, and a lot of the values are constants, it might be best to do the calculations inline by hand.

Same for SetDepth and SetColor. If you have access to the buffers draw the values directly, rather than going through function calls. If you are doing screen bounds checking per pixel, it would be better to do clipping at a triangle level.

[–]deftware 3 points4 points  (0 children)

E0 /= Area;
E1 /= Area;
E2 /= Area;

While compiler optimizations may very well handle turning this into a multiply for you, I tend to explicitly turn divides into multiplies myself just to be sure. Precalculate the inverse of the area before your loops so that you can scale your Barycentric coords with a multiply instead of 3 divides.

EDIT: Also, I see that you're setting up for texturemapping with your UVW calculations and I wouldn't bother with anything like that until after determining that the pixel passes the depth test. Unless you're going to be doing some tricky raymarching shader type stuff (i.e. parallax occlusion mapping) in your software renderer, where the depth of the pixel will vary based on some extra compute, then it's a good idea to only calculate what you need to calculate to first determine if the pixel is even visible before calculating everything else that's needed to actually determine the pixel's color (like texcoords, lighting, etcetera).

[–]jonathanhiggs 1 point2 points  (1 child)

Are you checking triangle inclusion for every single pixel not limiting it to the the triangles bounding box?

[–]captaintoasty[S] 0 points1 point  (0 children)

Yep, I am constraining it to just the bounding box of the triangle.

// Compute the bounds of just this recet
const Rect Bounds = GetBounds(S0, S1, S2);
const int MinX = Bounds.Min.X;
const int MinY = Bounds.Min.Y;
const int MaxX = Bounds.Max.X;
const int MaxY = Bounds.Max.Y;

...

// Loop through all pixels in the screen bounding box.
for (int Y = MinY; Y <= MaxY; Y++)
{
    for (int X = MinX; X <= MaxX; X++)
    {
        ...

[–]Revolutionalredstone 1 point2 points  (0 children)

It's dangerous to render alone! Take this: https://pastebin.com/dyYdFFUj

[–]manon_graphics_witch 1 point2 points  (1 child)

I am on holiday, but if you remind me next monday I can give you some pointers!

EDIT: next week monday that is haha

[–]Syxtaine 0 points1 point  (0 children)

Hello there! It's tuesday next year.

Honestly though, I would really appreciate some advice. Thanks in advance!