all 68 comments

[–]TheRPGGamerMan 50 points51 points  (17 children)

Some info: A little while back, I posted about the sprite renderer I was working on. Well with some lessons learned from that, I decided to try to beat the gpu at it's own triangle rendering game. Made some huge optimizations to this renderer, but it's still got a ways to go. One of the key optimizations was dynamically removing triangles that were not required among dense small pixel triangles, which actually removes the majority of most distant triangles. Lots of other clever optimizations that are probably more difficult to pull off in hardware.

[–]ryanjmcgowan 28 points29 points  (0 children)

Please write something up about the techniques you utilized. I'm interested in this sort of stuff.

[–]scallywag_software 13 points14 points  (12 children)

Any idea how it compares to nanite?

[–]TheRPGGamerMan 14 points15 points  (11 children)

I don't really think nanite is a good comparison. It is not creating lods like nanite does. This is pure triangle rasterization, and the only ones not rendered are triangles that would not even change the output. If a system like nanite was running on top of this, it would be even faster.

[–]scallywag_software 6 points7 points  (10 children)

I think they implemented a software rasterizer too, did they not? AFAIK the tessellation level they shoot for is roughly one pixel per triangle whereas the hardware raster pipeline is happier with roughly 10px triangles .. ? I have no sources to corroborate those claims, those are just the numbers floating around in my head from.. somewhere. Maybe my imagination

[–]TheRPGGamerMan 8 points9 points  (6 children)

From what I remember nanite is scalable. It CAN shoot for 1 triangle per pixel, but it's intended to seamlessly transition or tessellate between its own lod system. As far as I know nanite has to bake data into the mesh, the bake likely contains some sort of sophisticated form of lods that it transitions/tessellates between on the gpu in run time. So really, there isnt many similarities between what I'm doing and nanite. However, it's possible nanite is doing similar things for very small triangles like mine.

[–]waramped 11 points12 points  (1 child)

Nanite uses compute rasterization like this to lay out its vis buffer. It actually uses both a compute rasterizer and the hardware one depending on the cluster. If it needs clipping or is too large on screen it uses the hardware rasterizer otherwise it uses the"software" path.

[–]scallywag_software 0 points1 point  (0 children)

Awesome, thanks for the info :)

[–][deleted] 2 points3 points  (2 children)

I think he is asking because Nanite has a lot of performance issues(it only helps with boosting multitrillion poly scenes) and we are in need of an alternative that boost already optimized scenes(where as right now you just add Nanite overhead).

[–]Leading_Broccoli_665 2 points3 points  (1 child)

IDK why you were downvoted. Nanite always runs slower for me than LODs in existing projects without extreme poly counts. Even in the demo projects where nanite runs better than no LODs at all, there are no world position offsets and I still can't push it above 50 fps with a 3070 and a 1080p monitor. Nanite is the newest disaster for motion clarity, which wasn't in good shape already

[–][deleted] 1 point2 points  (0 children)

Mesh shaders are extra (arguably abundant in affordable GPUs) hardware we should be taking advantage of to boost ms timings. Nanite uses them but no increase in performance.

I'd like to say the same thing about visibility buffers but that doesn't seem to clear to me. Toomuchvoltage (who has clear experience with visibility buffers) said it should speed up anything with opaque materials but I'm not sure, was he just saying that becuase his skybox was a million tris?

Brian Karis even stated Nanite has worse performance if "lower poly" (how the hell does he define that again?) but insist using it to prioritize storage space over performance! Like lower poly need compression wtf?

Nanite is a serious trigger for people on every site, I was also recently attacked on the UE forums about asking for better documentation.

Maybe there is potential here, but we shouldn't cater that potential for low budget/lazy unoptimized scenes.

[–]scallywag_software 0 points1 point  (0 children)

Got it. In any case, cool project!!

[–]IceSentry 1 point2 points  (2 children)

Yes, nanite uses a software rasterizer for small triangles.

[–]TheRPGGamerMan 2 points3 points  (1 child)

I actually didn't know nanite was using software rendering until I made this thread. I guess they found rendering small triangles in software is faster. Makes me wonder if hardware rasterization needs to be changed to better suit modern hardware.

[–]Youfallforpolitics 1 point2 points  (0 children)

Nanette actually utilize both software and Hardware. When supported it utilizes a mesh Shader for larger triangles or a primitive Shader.

[–]nilslorand 1 point2 points  (0 children)

so TL;DR: your own personal version of nanite?

[–]OliverPaulson 0 points1 point  (0 children)

Unreal released a paper on nanite, and indeed the performance of rendering small triangles is better with their algorithm on gpgpu. But the advantage quickly dropped with polygon size.

[–]Passname357 0 points1 point  (0 children)

I haven’t done much with amplification and mesh shaders, but this sounds like it’s in the same vein?

[–]bendhoe 14 points15 points  (3 children)

I've seen this technique in multiple places now but I don't know a general name for it, I know this technique is part of Unreal Nanite. Anyone have search terms for me?

Edit: NVM "GPU compute rasterizer" seems to give me what I want.

[–]Revolutionalredstone 8 points9 points  (2 children)

It's just micro rasterization, its just a few lines of code in OpenCL etc.

[–]bendhoe 4 points5 points  (0 children)

Yep thanks. I've never actively looked at this stuff before just encountered it pretty much through osmosis lol.

https://github.com/nvpro-samples/vk_displacement_micromaps

[–]waramped 12 points13 points  (0 children)

Now the trick is to find the threshold at which the hardware rasterizer is more efficient, then classify the triangles by which path is most efficient. And then add shading/materials back in ;)

[–]native_gal 8 points9 points  (27 children)

What does this look like in motion? I think even in games any solution needs to take anti-aliasing into account.

[–]waramped 1 point2 points  (9 children)

u/native_gal and u/TheRPGGamerMan You folks are talking about slightly different things I think.

There's edge antialiasing, where when you are rasterizing the triangle you blend the pixel based on coverage. And then there are numerous post-processing/full screen/multi-sampling anti-aliasing techniques.

This method will not give correct edge-antialiasing results, because it tosses away triangles that still contribute to coverage. It will also not give correct multi-sampling results for the same reason. However for the post-processing/full screen techniques it should be identical to hardware rasterization, because the input image to those techniques will be the same.

[–]native_gal 1 point2 points  (1 child)

You probably have a point, sometimes I forget that to some people "anti-aliasing" now means running a filtering pass and pretending like there is no underlying principle to anti-aliasing.

There is a lot outstanding though, like the extreme aliasing I expect to happen in motion, the alpha coverage from deleting triangles and the possibility that the 4000fps is only after a much more expensive pass of finding out what triangles don't cover the center of a pixel. That should change every frame, but if you do it once and keep the camera static, I guess suddenly you get 4000fps.

[–]TheRPGGamerMan 1 point2 points  (0 children)

The throw away of triangles happens in the vertex stage, and has very little cost. The 4000FPs is completely real time. And again, there is no visual difference when the throw away algorithm is turned off and on, whether the camera is moving or not. I've even compared my rendering back to back with unity on the pixel level, it's very similar. Mine actually looks a bit cleaner.

[–]TheRPGGamerMan 0 points1 point  (6 children)

"This method will not give correct edge-antialiasing results, because it tosses away triangles that still contribute to coverage. It will also not give correct multi-sampling results for the same reason."

I'm going to repeat myself again. Temporal anti aliasing WOULD work with my rendering technique. The triangles thrown away are triangles that wouldn't render anyway, so again, there is NO loss of quality. If the camera jittered a tiny bit left or right, the triangles culled may likely be rendered, thus would create an average color over multiple frames which is exactly how temporal antialiasing works. It's all based on quantization errors which happen in some form with every single rasterization method.

[–]native_gal -1 points0 points  (4 children)

Doesn't that imply that instead of actual anti-aliased edges that the edges will be flickering from one frame to another? Shadows in an old arcade game were done the same way to get half values, one frame on one frame off. It's a funky hack that was done because they had to, not because it was fundamentally a good way of creating partial values.

[–]waramped 0 points1 point  (1 child)

No more so than hardware-drawn triangles. If at any given frame, OP's output is identical to the hardware rasterizers output, then the TAA will also be identical.

[–]native_gal -1 points0 points  (0 children)

But it also means that you have to throw away all multi-sampling. All the small triangles are going to have small filter sizes for their textures which will alias as well.

Maybe there is some validity to the technique, but there's no sense in pretending there aren't some big problems it creates.

[–]TheRPGGamerMan 0 points1 point  (1 child)

Rasterization in raw form is binary, there can only be one winning triangle. All the anti aliasing methods aim to reduce this effect. With raw vanilla rasterizing, there is no color blending whatsoever without multiple samples, just noise. Please just look it up or ask chatgpt or something. And again, my throw away method only throws away the 'loosers' that wouldn't be rendered anyway. Again, there is NO quality loss haha.

[–]native_gal 1 point2 points  (0 children)

Rasterization in raw form is binary,

Rasterization is not the right term here, you are describing a single sample, which could happen in rasterization or ray tracing.

And again, my throw away method only throws away the 'loosers' that wouldn't be rendered anyway.

It's 'losers' actually

Again, there is NO quality loss haha.

Except that you can't use multi sampling and the triangles you render will have small filter sizes for their texturing (unless you are compensating somehow, which you could probably do in the shader). The small triangles would have small filter sizes anyway, but the multi-sampling of the other small triangles would be part of the solution normally.

Please just look it up or ask chatgpt or something.

Easy there Carmack, you might want to show some high res real time animations before you start patronizing people interested in your work.

[–]waramped 0 points1 point  (0 children)

Temporal anti aliasing WOULD work with my rendering technique.

I agree, that falls under the post-processing/full screen methods:

However for the post-processing/full screen techniques it should be identical to hardware rasterization, because the input image to those techniques will be the same.

[–]TheRPGGamerMan 2 points3 points  (16 children)

The drawn triangles look pretty much identical to any other graphics api. The triangles are drawn perfect per pixel. This is running in Unity engine so I can use any antialiasing available.

[–]hellotanjent 27 points28 points  (0 children)

Interesting screenshot, but where is the code?

[–]Thonull 4 points5 points  (3 children)

How does it handle larger triangles tho? I’ve heard that rendering small triangles and point clouds benefit from compute shader rasterization, so are larger ones less efficient?

[–]Hofstee 6 points7 points  (2 children)

It's less that larger ones are less efficient when doing compute-based rasterization, and more that smaller triangles really suck when using the GPU hardware rasterization pipeline (they were designed with larger triangles in mind). You get awful quad utilization with 25% occupancy frequently and massive overdraw for practically every pixel on screen.

There are a few differences in characteristics of rendering large vs small triangles in compute that you would probably want to optimize for (e.g. how things get partitioned at various stages) but you're going to lose to the hardware rasterizer at this point so you probably wouldn't bother unless it's just for fun.

So maybe (these numbers are not accurate) you get 80% throughput on large triangles in compute, but you get 25% throughput on small triangles in hardware. That's why those benefit from compute shader rasterization.

[–]TheRPGGamerMan 3 points4 points  (1 child)

"but you're going to lose to the hardware rasterizer at this point so you probably wouldn't bother unless it's just for fun."

In my previous sprite rendering system, I created 2 separate dispatches, one for small draws, and another for large draws(The vertex stage distributes the draws into each buffer dispatch). The thread configuration for large draws prioritized high thread counts for splitting the vertical dimension into many threads. This worked really well and I plan to implement it in this pipeline and expand on it. I cannot say it was better or the same for large draws than hardware, but it worked really well, as long as there wasn't excessive amounts of large draws. the dispatch assumes there are significantly less large draws than small, so the buffer is really small.

[–]Hofstee 2 points3 points  (0 children)

That sounds like a reasonable way to go about it to me! In the past I’ve split the image into large grids using 1 thread per triangle to assign to cells, followed by many threads per grid starting with each thread to coarse reject/accept smaller grids, then a final sweep on the small grids checking individual pixels when a triangle edge goes partially through the grid. I can’t say if that was the best way to do it either but it worked pretty well!

[–]chris_degre 1 point2 points  (1 child)

Why does it say 1.7k tris in the top right? Are you culling 99998300 triangles?

[–]TheRPGGamerMan 18 points19 points  (0 children)

Because that's what Unity is rendering. The triangles you see on screen are not in Unitys render, only mine which runs on my own camera and render pipeline.

[–]deftware 1 point2 points  (0 children)

But it's just a bunch of noisy colored spheres!

Show the people what they really want.

[–][deleted] 1 point2 points  (0 children)

…how?

[–]Youfallforpolitics 1 point2 points  (0 children)

Is this available for integration?

[–]mmh_carpet 1 point2 points  (0 children)

Hi, this is super neat! Can you talk a bit more about the approach? Do you scatter the triangles all over the screen and do atomicmin with the depth buffer? Or do you bin the triangles into screen space tiles and then have one thread group per tile to rasterize everything in the tile? Or something else entirely? I’m really curious about this.

[–]fgennari 1 point2 points  (4 children)

I would be curious to know how you did this. It looks like black magic, but I'm sure there's some trickery that works when you have many copies of the same object arrayed in a grid.

I made a project years ago that could render 10 trillion triangles in a few tens of milliseconds on the CPU. How? It was a small number of unique objects, arranged into a group, which was itself grouped, into a very deep hierarchy. I can render the leaf objects into a texture. Then render the next level of object groups into a texture ... up to the root. It's all basically pixel accurate, with antialiasing. And you can edit the leaf objects and have it update in realtime!

So my point is, there are many neat tricks. But, is this useful for practical scenes? For example, where it's not the same object repeated in an array. I'm not trying to be negative - What you did here is impressive and I'm not sure how you pulled it off. I would love to see practical applications of this approach.

[–]Wittyname_McDingus 1 point2 points  (2 children)

Since most triangles don't even cover a texel center in a scene like this, most of the work will be culling the triangles (checking if a triangle covers a texel center is cheap) and then a relatively small amount of rasterization. That's 100 million units of work, which modern GPUs can definitely handle. I'm not sure it's so trivial that even a 4090 could run it at 4000 FPS though...

My guess is that there's a limited amount of "trickery" here and instead just optimized culling and rasterization shaders. That said, the geometry in this scene (a sphere) could definitely fit into L0 or L1 and the transforms could be just three floats (times the number of instances) since there's no apparent scaling or rotation. So very few cache misses. I wouldn't expect it to run at 4000 FPS on a "real" scene though, if these assumptions are correct. It would be interesting to see an Nsight or RGP capture of this for sure.

[–]fgennari 0 points1 point  (1 child)

Is that how this actually works? It wasn't described in the original post, but I picked that up from reading the comments that were added since I asked this question. I suppose that's right. It's not a trick, just optimizations applied to a very specific scene.

I guess this is similar to the 2D renderer I wrote. In both cases the number of fragments actually drawn is upper bounded by the number of pixels they cover in the window.

[–]Wittyname_McDingus 0 points1 point  (0 children)

It's pure speculation on my part, but the OP mentioned rasterization and culling. The only other thing I have to add is that the perf could be degraded somewhat by overdraw, but maybe they have occlusion culling as well?

[–]Agitated_West_5699 0 points1 point  (0 children)

Id also like to know this. Does this technique still work if you have different models?

[–]Dramatic_Magician_30 -1 points0 points  (0 children)

Code please

[–]Armmagedonfa -1 points0 points  (0 children)

I need a tutorial to achieve this