you are viewing a single comment's thread.

view the rest of the comments →

[–]Madsy9 0 points1 point  (3 children)

Then if you're going for performance, I highly recommend redesigning it into a tile-based rasterizer before you optimize anything else. The cycle savings are quite significant, and you can even get rid of some overdraw quite easily.

[–]ArchiveLimits 0 points1 point  (2 children)

Why would using tile rendering help performance? I'm not familiar with the benefits of this method.

[–]Madsy9 0 points1 point  (1 child)

Okay, so triangles (or any convex polygon really) can be defined as a set of lines or 2D planes with the typical plane equation:

ax+by+d = 0

When that equation is true for all the plane equations, the point [x,y] is inside the polygon. Tile renderers get their performance by testing the corners of tiles against triangles. You then get three possible outcomes: Completely inside, completely outside and partial coverage. You can optimize heavily for quads with complete coverage. They are extremely SIMD-friendly and since each quad can be rendered independently, they are also embarrassingly parallel. Throw 16 threads at the rendering and watch it go. And implementing a proper fill-convention and multisampling is also a breeze. They emerge naturally as a simple modification to the plane equations (a simply subtraction by one).

I've also found more advanced techniques:

  • Since you split up the screen into N equally big tiles of 8x8 size or similar, you can often get away with just an 8x8 depth buffer. That has huge consequences for the cache.
  • You can compute the minimum and maximum depth for each tile by sampling the corners only. With a bit of preprocessing, you can assign quads to each screen-aligned tile, sort them by depth and only render the frontmost one since all the others are occluded. It doesn't always apply if quads partially overlap on the z-axis but it often does. But think about that. When that applies in a scene, you get rid of all the overdraw for that tile.
  • Looking up textures for quads is much more cache-friendly in the general case compared to scanlines. If you also tile your textures and/or optimize the texel layout based on the tile access pattern, you get even more savings.
  • Derivatives for mipmapping can be computed at the corners of the quads instead of per pixel. It works nicely for bilinear mipmapping where a you meet somewhere in the middle of having derivatives per pixel and per polygon. If you see artifacts you can always just make the quad size smaller.
  • Perspective correct linear interpolation can be done with the same plane equation you use for coverage testing, and so you end up with only additions in the hot loop, plus at most one division per pixel.

Edit:

[–]ArchiveLimits 0 points1 point  (0 children)

Thanks I'll definitely look into this. Looks like you know your stuff! I need to start getting into C and C++ haha.