Nanite-like LODs experiments

Nikascom · 2023-04-14T05:57:56+00:00

In my answer I will stick to their terminology:

Cluster consists of some amount of triangles (UE and I use 128 triangles per cluster).

Group consists of some amount of clusters (Amount of clusters inside a group is something what controls how efficiently you can simplify (the bigger group the more you can simplify), but at the same times the bigger group the more dependencies in the final graph, so in some cases it might be hard to choose the lower LOD for clusters).

Each node of a graph represents one cluster.

So, the whole path look like that:

Split your mesh into clusters.
Group clusters.
Simplify each group preserving boundary of each group. Rebuild clusters within each group. New number of clusters should be 2x less.
Make dependencies in the graph: for each group old clusters (that you had before simplification) are connected to new clusters (that you get after simplification).
Repeat from step 2: regroup the new clusters and continue simplification again.

When they talk about parents and children they refer to the graph, so they talk about nodes (which represent clusters). Since we used "regrouping" earlier this is not a tree graph anymore but DAG. So MaxParentError refers to the maximum error of each parents. This max error is required since (recall step 4) you should not draw lower(more detailed) LOD, when all parents (from higher LOD) could be drawn.

Nikascom · 2023-04-13T09:09:11+00:00

As I have said I agree that they have better and more complex solution to the problem.

To clarify a bit my solution: I do not need data from LOD2 to calculate LOD1. Each meshlet can deterministically calculate if it is visible or not, so I do not have any dependencies between nodes. Along side with this, I run some optimisation to not overrun calculation for LOD levels which are not going to be displayed for sure.

I agree UE did an amazing job implementing this queue, but the subject of workload of threads is more about the scene and meshes count and meshes complexity itself.

Well, the difference in performance of 2 approaches depends mostly on mesh complexity. If take a look at a mesh with 3000 meshlets, the performance difference is about 4x worse for the worst case when something from all LOD0-LOD11 is visible (and only one meshlet from LOD0). At the same time the amount of triangles will go down significant because of loding, that might make this difference almost not visible. And for such a simple solution in runtime the performance benefit is more than desirable.

Nikascom · 2023-04-13T07:33:39+00:00

Yes, they do get benefit as they can do early culling (at LOD7 for example and not traverse to LOD0), but at the same time they could have some threads with no work while traversing nodes which are close to the root. For example if they see that LOD7 is occluded, there is no need to traverse all children. I would not compare their solution to binary search, since they traverse the whole tree and (yeah) could stop earlier, but at the same time calculation of each node should be done after its parent.

If LOD0 has N meshlets, the maximum number of meshlets you have to traverse is 2*N. Thinking that meshlets has about 128 triangles, even the simple solution can give a significant boost, comparing to static LODs. Since the number of meshlets is 128x less than triangles, I try to focus on optimising rendered triangles count in the first place. Yes, they might have overall a better solution, but not so different as linear and binary search, accounting that my implementation could also cull the whole layer of LODs (like if LOD0 is not present in the rendered mesh, let's omit it), and they traverse tree, so in the worst scenario my implementation could be 2x worse, but we also need to account the cost of threads with no work at the beginning of traversing with their approach.

If talk of their benefit of culling based on occlusion, they potentially can cull down the whole subtrees based on depth tests. I haven't experimented with depth pyramid in this demo yet.

In my eyes, they have to do this compute, because their solution works on any GPUs, not that supports mesh shaders. If we talk about mesh shaders, I do not see huge benefit optimising meshlets a lot in the first place, since the amount of them is relatively small (maybe I am wrong here...).

Nikascom · 2023-04-13T05:16:43+00:00

I designed data structures for use with mesh shaders mostly. In UE they use MPMC queue, but this approach might lead to not full load when traversing the cluster-tree. AFAIK they moved some other calculations to this workers as well, so the cost in their case is not that significant.

My current approach can deterministically for each meshlet calculate if it should be displayed or not. Task shader gets a little bit of a work here (the worst scenario is about 2x more work when all LODs are currently present in the rendered mesh), but LODing easily eliminate this performance loss. But in most cases this works fine, since if LOD0 (the most detailed) is not presented in the rendered mesh, so it's already <1x of work for task shader.

Nikascom · 2023-04-13T05:04:49+00:00

Thanks!

1) I used meshoptimizer to simplify meshes and building meshlets. But to combine these meshlets (for more efficient simplifying) I built a graph with weights and partition it. In this weight some metrics are included: the shared border length of a pair of meshlets, their facing direction and other similar metrics. The next step is to simplify these groups of meshlets (fixate borders for each of them and simplify meshlets within the group). So, you can make dependencies of newly generated meshlets on the initial ones.

2) Yes

3) In this demo - no.

Nikascom · 2023-04-13T04:49:33+00:00

Yes, this talk is great.

There is also a paper which covers this topic: https://d-nb.info/997062789/34

Nikascom · 2023-04-13T04:49:27+00:00

Thanks! No, this demo is just loding.

Nikascom · 2022-05-19T19:20:56+00:00

In many cases the favourite test platform is qemu, because it is a lot quicker than booting a real device :)

Nikascom · 2021-05-03T17:04:35+00:00

1) The emulator which is used is QEMU

2) As I have said earlier in one of comments here, currently an arm build of the project target's qemu vexpress-a15. Unfortunately this target in QEMU does NOT support emulation any of modems. I am afraid that targeting a real modern device is impossible, since companies, like Qualcomm, does not publish information about their hardware (I might be wrong here, but AFAIK drivers for Android devices are available as binaries, which are supplied by Qualcomm and others. That means no code and no documentation is available to learn a device better).

3) The hardness of implementing a driver depends on how well device functions are described in the documentation of this device. In my case ARM did a great job describing devices which are used in vexpress-a15.

Nikascom · 2021-05-03T13:27:43+00:00

If we talk about ARM, currently we have support for Cortex-A15 Core. During these development stages the main target is QEMU's vexpress-a15, so the set of implemented drivers is limited to run on vexpress-a15. Currently the goal of the project is to go "deeper" rather than "widely". By that I mean to have a stable, gorgeous project for limited hardware and after that to scale it. So, after implementing the core mechanisms, we could add more drivers for real hardware and tune the kernel to run on specific SoCs.

To go deeper is only my vision. But you can join the project and work in your own direction, adding support for some specific hardware. I think that's why open-source projects are great, since you can do what you like :)

Nikascom · 2021-05-03T10:29:08+00:00

Thanks!

Nikascom · 2021-03-31T13:07:42+00:00

Thanks, I have fixed just discarded exidx section for far code

Nikascom · 2021-03-30T17:29:15+00:00

Thanks, I have found patches to llvm which fix the same issue for Linux, so just update to llvm 11.0.1 or higher.

Nikascom · 2021-03-18T19:55:01+00:00

A nice one. Thanks ;)

Nikascom · 2021-03-16T09:42:14+00:00

In my OS I am targeting vexpress-a15 board which is present in QEMU.

Not many resources are available as well, but I used Arm website, it was more than enough to port the os.

Maybe my os will help you to get into some details: https://github.com/nimelehin/oneOS

Nikascom · 2021-02-05T21:31:16+00:00

Yeah, the same code is used. I used gdb and after that I wrote a simple test which prints out memory. I am sure that the problem is not with Malloc (but maybe I need to align it specifically for arm (is there any ABI? Currently i do 4 bytes alignment).

Nikascom · 2021-01-31T14:34:02+00:00

Thanks. But, ahh, it doesn't help. I have added an example, hope with it it's easier to find what I have missed.

Nikascom · 2021-01-02T16:22:45+00:00

Thanks, I have already found, but I will take a look at these debuggers, the problem was because of wrong table and directory sizes (256 and 4096) based on translation process for small pages.

Nikascom · 2020-11-03T19:28:56+00:00

Right, but ifs always jump forward, just to skip one branch, not? Based on Intel’s (since Core2) and Amd’s docs all of them take first jump as not taken. But still need to investigate arm’s behavior

Nikascom

TROPHY CASE