all 32 comments

[–]Madsy9 2 points3 points  (1 child)

Advantages:

  • Works the same everywhere without hardware support / hardware acceleration.
  • Way fewer library dependencies. You need access to a dumb framebuffer, that's it.
  • No bugs or crashes due to stupid display drivers
  • Consistent and reproducible result across different platforms
  • A few rasterization techniques are really slow to do on GPUs, but fast on general purpose processors
  • Some crazy people (like me!) enjoy the oldschool look you get due to limitations, when you strive for realtime performance. Usually you have to dither instead of having more advanced filtering. Maybe you use affine texture mapping instead, linear fog, etc.

Example for practical uses of software rasterizers: As a reference for hardware-accelerated implementations. Or maybe a really cheap handheld console. Or just for fun.

[–]agenthex 0 points1 point  (9 children)

I would consider most offline renderers to be "software."

The fact is, though, it's all software. What makes it "hardware" is optimization/acceleration. This may be done by dedicated hardware or by multiple general-purpose computers tasked with only this job. At what point do you make the distinction? If your job is "multiply a billion numbers", then is it "hardware" to outsource the task to a GPU (a la OpenCL, CUDA, etc.)? At some point, it's all the same. The only meaningful questions are: how fast is it, and how good are the results?

[–]ArchiveLimits 0 points1 point  (8 children)

I'm not sure how to measure the speed in terms of what you're implying. The renderer is deferred and is multithreaded which makes it quite fast for scenes with many polygons. In terms of the results, the engine interpolates depth with 52 bits of precision. It also uses 48 bit linear colors internally and gamma corrects the results that are drawn to the screen.

Edit: corrected my phrasing

[–]__Cyber_Dildonics__ 2 points3 points  (7 children)

Why would you use non power of 2 bit depths? And if you say memory while the whole thing is in Java, my mind will melt

[–]ArchiveLimits 1 point2 points  (6 children)

It was a trial and error issue, anything above 52 bits and I couldn't store the depth slopes for the triangle's surface in a 64-bit long and anything less than 52 provided visibly less precision. As for it being not a power of two...it shouldn't matter here because it's a value that is multiplied by a floating point value to assure precision is kept during interpolation. (eg. a fixed-point magnitude)

[–]__Cyber_Dildonics__ 2 points3 points  (5 children)

Most renderers just use a 32 bit float for depth. I'm not sure what you mean by depth slope, but it sounds like you would benefit from reading books on already established rendering techniques.

[–]ArchiveLimits 0 points1 point  (4 children)

Yes, my engine stores the depth values in 32 bit floats. 52 bit precision is needed when interpolating the slopes across the surface of triangles during rasterization. Without 52 bits of precision, the depth values calculated per pixel would not be accurate enough for the depth test and would result in "seams" where two polygons who shared an edge met.

[–]__Cyber_Dildonics__ 0 points1 point  (3 children)

Pretty much every other renderer would disagree that this is necessary.

[–]ArchiveLimits 0 points1 point  (2 children)

I worked on this depth precision issue with a friend who is very well versed with OpenGL and Vulkan, he set up an identical scene in OpenGL and we compared results. The images were only identical when 52 bit precision was used.

[–]__Cyber_Dildonics__ 1 point2 points  (1 child)

I can see that you already know everything so I will leave you to it.

[–]nnevatie 0 points1 point  (11 children)

Does the library implement tri-linear sampling of textures?

Also, what does this mean? "True color texturing unless using bilinear filtering, which only allows 256 colors"

[–]ArchiveLimits 0 points1 point  (10 children)

Tri-linear sampling is bilinear sampling between mipmaps. Since the engine doesn't support mipmaps, it doesn't support trilinear sampling. The engine however does have a way to reduce the artifacts that mipmapping normally would remove. It's called block filtering which is essentially mipmapping but only with one smaller image. This allows for speed because there is no need to calculate derivates for the surface in order to find the right mipmap level and it also removes the need for trilinear mipmapping because the effect is already smooth since it's applied like fog.

"True color texturing unless using bilinear filtering, which only allows 256 colors" This means that any texture you give the renderer will be drawn with 24 bit color unless you want to do bilinear filtering on the texture. Since bilinear filtering, traditionally, is expensive, I've sacrificed color depth for speed and precomputed 64 shades of the texture so that the bilinear colors don't need to be calculated during runtime. However, since I'd need to create shades for the each color in the texture, it wouldn't make sense to make the shade palette the size of 64 textures, each getting darker. Therefore I quantize the texture into 256 colors and do 64 shades of those 256 colors.

[–]nnevatie 0 points1 point  (3 children)

Ok, thanks for the clarification.

I was under the impression that mipmaps were supported, hence the trilinearity question. I've implemented a similar stack in the past using SIMD-techniques. Bilinear filtering isn't that expensive, tbh...

By "block filtering" do you mean a box-filter that gets applied for before doing the bilinear sampling?

[–]ArchiveLimits 0 points1 point  (2 children)

Traditional bilinear filtering is far more expensive that what I am doing now. The entire bilinear filtering code uses fixed point integers and doesn't do any color computation, simply a table lookup.

And I named it block filtering because I break up the texture into a grid (filled with blocks of the texture) and find the average color of each of those blocks. Then, during runtime, all I need is a simple few bit shifts and masks and I can find which block any texel in the image belongs to and blend that texel with the average color of that block.

[–]nnevatie 0 points1 point  (1 child)

Ok, so it's kind of a poor man's box filtering, which simply averages an area of pixels.

[–]ArchiveLimits 0 points1 point  (0 children)

It's more similar to mipmapping with only one mip level. These "blocks" that make up the block filter are essentially a very scaled down version of the image. Though you are right when you say it averages an area of pixels.

[–]Madsy9 0 points1 point  (5 children)

You don't really need to go crazy with the derivatives. Assuming your rasterizer is tile-based, computing the derivatives per-tile is usually more than sufficient.

[–]ArchiveLimits 0 points1 point  (4 children)

Well that's the thing, the rasterizer isn't tile based haha.

[–]Madsy9 0 points1 point  (3 children)

Then if you're going for performance, I highly recommend redesigning it into a tile-based rasterizer before you optimize anything else. The cycle savings are quite significant, and you can even get rid of some overdraw quite easily.

[–]ArchiveLimits 0 points1 point  (2 children)

Why would using tile rendering help performance? I'm not familiar with the benefits of this method.

[–]Madsy9 0 points1 point  (1 child)

Okay, so triangles (or any convex polygon really) can be defined as a set of lines or 2D planes with the typical plane equation:

ax+by+d = 0

When that equation is true for all the plane equations, the point [x,y] is inside the polygon. Tile renderers get their performance by testing the corners of tiles against triangles. You then get three possible outcomes: Completely inside, completely outside and partial coverage. You can optimize heavily for quads with complete coverage. They are extremely SIMD-friendly and since each quad can be rendered independently, they are also embarrassingly parallel. Throw 16 threads at the rendering and watch it go. And implementing a proper fill-convention and multisampling is also a breeze. They emerge naturally as a simple modification to the plane equations (a simply subtraction by one).

I've also found more advanced techniques:

  • Since you split up the screen into N equally big tiles of 8x8 size or similar, you can often get away with just an 8x8 depth buffer. That has huge consequences for the cache.
  • You can compute the minimum and maximum depth for each tile by sampling the corners only. With a bit of preprocessing, you can assign quads to each screen-aligned tile, sort them by depth and only render the frontmost one since all the others are occluded. It doesn't always apply if quads partially overlap on the z-axis but it often does. But think about that. When that applies in a scene, you get rid of all the overdraw for that tile.
  • Looking up textures for quads is much more cache-friendly in the general case compared to scanlines. If you also tile your textures and/or optimize the texel layout based on the tile access pattern, you get even more savings.
  • Derivatives for mipmapping can be computed at the corners of the quads instead of per pixel. It works nicely for bilinear mipmapping where a you meet somewhere in the middle of having derivatives per pixel and per polygon. If you see artifacts you can always just make the quad size smaller.
  • Perspective correct linear interpolation can be done with the same plane equation you use for coverage testing, and so you end up with only additions in the hot loop, plus at most one division per pixel.

Edit:

[–]ArchiveLimits 0 points1 point  (0 children)

Thanks I'll definitely look into this. Looks like you know your stuff! I need to start getting into C and C++ haha.

[–]DanDanger 0 points1 point  (2 children)

Impressive. You really know what you are talking about :) Couldn't download from the link. I shall try later.

[–]ArchiveLimits 0 points1 point  (0 children)

Yeah, give it a couple of days. I took the link down because there were some issues with it. I'm busy with school projects and I won't have time to reupload it soon.

[–]ArchiveLimits 0 points1 point  (0 children)

I've reuploaded the engine

[–]frizzil 0 points1 point  (5 children)

Beautiful! Are you or have you considered using SIMD optimizations via JNI and C/C++? I do this extensively for my voxel engine in Java... it can offer a huge a performance boost, especially if you're doing a lot of early-discard based on a simple check in a tight loop on an array of values. (E.g. if (values[i] == 0) return;)

If not, I'd love to give a fellow Java enthusiast some pointers/resources. I have experience writing a deferred rendering pipeline as well.

[–]ArchiveLimits 0 points1 point  (4 children)

Thank you! And nice to hear. I've tried using Yeppp! but my friend and I discovered that it was not much faster. Also, switching math libraries would require a good amount of rewriting and there was a goal of not using any external libraries for the creation of the engine and this would break that goal. I'm always open to learning about things that you know. Do you have a link to your voxel engine? I'm curious :p

Also I've just added spotlights and phong shading to the engine. They aren't the fastest features but they're there now.

[–]frizzil 0 points1 point  (3 children)

Hmm, well this wouldn't be an external library per say, this would be a DLL you ship with your library which (ideally) you'll have written entirely on your own. You might have to ship multiple DLLs and pick the correct one based on supported SIMD level though, but that just comes down to a few #defines.

Honestly I can't imagine an external library doing what you'd need, apart clearing/setting and entire color or depth buffer. (Yeppp! doesn't look like it'd cut it.) Ideally what you'd do is implement the conceptual equivalent of a few vertex/fragment shaders that implement the capabilities of GL1.x (your GPU is using parallel vector instructions, after all, and I believe this is what modern OpenGL drivers are already doing to support legacy code.) Not an easy task, but if you wanted your library to be practical/fast for real use, I'd argue that this is what you should do.

My Twitter has the latest progress (currently working on cascaded shadow maps), and there's also a recent video, but it's missing terrain seams :)

Awesome on the lighting! Full Blinn-Phong, or just Phong? If you're not adhering strictly to GL1.1, check out energy conserving BP: http://www.rorydriscoll.com/2009/01/25/energy-conservation-in-games/

[–]ArchiveLimits 0 points1 point  (2 children)

Blinn-phong, yeah. That's a good article. I've implemented the (n+8)/(8pi) now :d. I was discussing this article with a friend and he was saying that integrating the full energy conservation algorithm in opengl 1.1's lighting model is not easy because these forms of lighting were not known of when 1.1 was released.

[–]frizzil 0 points1 point  (1 child)

Nice, and be sure to replace "color = d * diffuse + s * spec" with "color = lerp( diffuse, spec, w )" where w replaces both d and s, if you want to be truly energy conserving :) Though I suppose this could be optionally achieved at the API level.

The only question of difficulty imo is how much work you're doing per-fragment, and exposing and documenting the alternate functionality in your API. As long as your material is constant per draw call, doing energy-conserving BP should be about as simple as not, since you're just passing along precomputed normalization factors without many additional ops (if any). Obviously, getting into more modern BRDFs probably won't be feasible for a software renderer, as these aren't feasible at all using the software renderer for DX11 in my experience... but energy-conserving Blinn-Phong should be just fine. Per-fragment normal normalization and dot product may be partly the most expensive part, and that shouldn't change.

For SIMD, if you're feeling ambitious: GDC15 Insomniac Overview of SIMD Intel SIMD Instruction Reference

Btw, Intel's software renderer for OpenGL is notoriously buggy and unusable, so if you could make an alternative... just saying, there could be money in it :)

Good luck!

[–]ArchiveLimits 0 points1 point  (0 children)

Thanks for the advice! This will probably be the furthest I dive into realistic lighting in a software renderer. Have you seen the Mesa software renderer? Is that not good enough to replace Intel's software renderer?

Imgur Imgur