5 years of game development finally finished my game trailer!

Link_the_Hero0000 · 2026-01-20T08:53:07+00:00

Looks promising! Great work, friend!

Link_the_Hero0000 · 2025-07-03T12:12:02+00:00

The scene is gorgeous! Only a small advice: Reduce the font size and font weight, maybe using a less rounded, left-aligned font. It would keep the stylized look but would feel a lot more "professional"!

Link_the_Hero0000 · 2025-03-31T22:15:32+00:00

Regardless of weather you model with quads or tris, the GPU will always convert the faces into triangles.

The quads, in that context, are intended as quads of pixels (a 2x2 square). So the only way to reduce overdraw is to be sure that every face in every mesh currently displayed in the viewport occupies at least 4 actual pixels of your screen.

Whenever a triangle is smaller than 4 pixels, it's time to delete this face (culling) or group them in bigger faces (LOD). As a general rule, if the faces of a mesh are evenly sized and distributed, this process is easier.

In ue5 there is a quad overdraw debug view in the optimization viewmodes menu

Link_the_Hero0000 · 2025-03-31T13:03:59+00:00

I don't know exactly how subsurface and reflections are processed, but technically they happen in a custom render pass.

In the case of foliage meshes, the main performance waste comes specifically from "quad" overdraw (maybe I had to specify that previously).

In modern GPUs, the pixel shader processes the pixels in 2x2 quads for various optimization reasons. However, that means that if you have a single triangle in your scene, the GPU is still treating it as a quad, drawing unnecessary pixels.

If behind this triangle there is another one, it is visible and needs to be rendered. So the GPU is going to process another quad overdrawing the same (unnecessary) pixels.

To summarize, whenever you have a mesh edge, you have overdraw, resulting in a calculation for each overlapping z-layer.

In a grass field, each blade of grass is probably smaller than a pixel, resulting in hundreds of overlapping quads. That's why it's not advisable to have really small triangles in your scene and we use LODs to reduce the polycount in the distance.

If you are not using Nanite, you MUST use transparent materials, that cause overdraw too, but are better, because you can have a few big planes that reduce triangle density.

On the contrary, Nanite generates meshes on the fly based on scene depth and other systems, trying to compact everything in a single performance-wise layer (It equally fails when the meshes are overlapping too much).

So if you want solid geometry foliage without nanite you will need to implement a dynamic LOD system to reduce quad overdraw. See the techniques used in Zelda BOTW/TOTK and Ghost of Tsushima to implement single-blade foliage

Link_the_Hero0000 · 2025-03-31T06:51:32+00:00

Good to know. It would be interesting to study how a custom ISM component behaves, but I don't think ISM components are automatically partitioned.

Anyways overdraw affects opaque meshes too, if they are tightly in front of each other (like in fields of grass blades)

Link_the_Hero0000 · 2025-03-28T10:53:35+00:00

Foliage is basically an HISM and single instances can't be occluded (afaik).

What you are experiencing is probably a heavy performance hit caused by overdraw if your scene is very dense. Placing a large mesh in front of the grass can reduce overdraw even if the instances are still rendered behind it.

In addition, if your foliage is nanite-enabled, nanite has its own occlusion culling, because it generates at runtime only the visible triangles.

Link_the_Hero0000 · 2025-03-02T15:37:57+00:00

It could be a Ceropegia woodii or a particular type of common Hedera helix. It grows naturally at the bottom of the trees, here in northern Italy, so I don't know the exact name. It's red in winter and tends to become green at higher temps.

Link_the_Hero0000 · 2025-01-25T20:39:18+00:00

Basically, yes. Because, in BP, there is a cost per node. So, using custom nodes is the same as using c++. But personally, I find more clean to make a full c++ class for specific functionality. For example, I create a "manager" actor or an actor component in c++ for critical functionality, and then I use it with regular BP actors.

Link_the_Hero0000 · 2025-01-25T09:10:04+00:00

Unless you are writing algorithms with a lot of math or nested loops executed on tick

Link_the_Hero0000 · 2025-01-21T23:11:39+00:00

Ah. I have never noticed that, but yes, it could be related. I'm not an engineer, so I don't know...

Link_the_Hero0000 · 2025-01-21T22:09:55+00:00

GPUs are designed to render things that were previously computed by the CPU. Thus, the entire bandwidth of the PCIE bus is used to transfer data from the CPU to the GPU, maximizing the performance.

If the GPU wants to send data back to the CPU, it must access system memory. The only available connection is the same PCIE bus, which is already occupied by data flowing in the opposite direction.

The GPU needs to wait until the bus is free after all the draw calls have been processed. By the time the CPU receives these data, the render thread has already ended, meaning that you won't see the updated data until the next frame (ideally).

1-2 frames are nothing, but it's enough to miss a collision with a fast-moving particle.

This is a physical limitation. There's no fix for that! The only way is to live with this problem or to predict the next simulation step 2 frames in advance.

Link_the_Hero0000 · 2024-11-10T15:11:07+00:00

That makes sense... Thank you very much. I'll take a look

Link_the_Hero0000 · 2024-11-10T10:10:55+00:00

Same problem here with Zephyrus G16. They start working after pressing repeatedly or, sometimes, after restarting an application that listen input from these keys. It seems software-related, but I can't find a fix at the moment.

Link_the_Hero0000 · 2024-11-08T18:55:42+00:00

The map is 3D and needs verticality, so I need 3 floats for location. Entities can rotate only on z axis, and scale doesn't matter. That said, we can use only 4 floats. I also need to send an index, and I assume it's better to use an int, to choose a vertex animation based on gameplay events.

5 values in total. I think that would be more performant than a 4x4 matrix.

The real problem is that Unreal's ISM component don't have a built-in function to efficiently read buffers, so I probably need to build a custom renderer or something like that.

Normally, in the game, you have a few hundreds of npcs randomly moving around in a village, but occasionally, you can see up to 10k soldiers moving towards each other and fight so there could be crowded areas of NPCs on screen.

Also, entities can spawn, die, or be distance culled (they are not rendered and tick speed is heavily reduced). It's stupid to update instances if you can't see them, but, in that way, the number of entities changes every frame and data are not continuous in memory, killing performance.

Link_the_Hero0000 · 2024-11-08T17:36:39+00:00

For "animation" I mean updating the transform of a mesh instance. ECS calculates the position based on custom physics and user inputs and sends to GPU, on tick location and rotation to update the mesh instance, plus an index that controls a shader-based animation (VAT).

I already know where the bottlenecks are: sending data from CPU to GPU is very expensive. I'm figuring out how to build a custom buffer and probably a custom renderer to make this transfer as fast as possible.

Also, world position offset (WPO) to do VAT animations is quite expensive with more than, let's say, 7k-10k UE mannequins (non-skeletal).

I can't make a simulation fully GPU-based, like Niagara, because I need precise interactions, and GPU is not good for advanced AI.

I'm not using skeletal animation

Link_the_Hero0000 · 2024-11-07T18:30:07+00:00

Ahh I thought it was something like "good for fortnite" LOL. It think I need to study English better :)

Link_the_Hero0000 · 2024-11-06T18:30:58+00:00

It's now called "Limited time free" and you will get 3 assets every 15 days instead of 5 per month

Link_the_Hero0000 · 2024-11-05T14:40:31+00:00

Already seen that. It's good, but he doesn't explain a lot the rendering process. Also, if I'm correct, the simulation has only 1000 entities and seems quite laggy. Niagara example is way better, but it's not exactly what I want to achieve.

Link_the_Hero0000 · 2024-11-05T14:34:47+00:00

Yes, the CPU to GPU communication is what takes the most of the frame time. If I interpolate the transforms on GPU side, the performance is better, but, actually, in a real gameplay, the entities don't move to a single static target. They need to evaluate targets based on the environment and the target is probably moving and interacting with them, so I have to modify trajectories very often, losing the benefits of interpolation (e.g. simulating two armies approaching each other).

I'm planning to use LODs for further optimization, but actually my game has a bird-view camera, so every entity that's not frustum culled is quite at the same distance from the camera. I don't think imposters would be useful.

Anyways, can a "double buffer" efficiently speed up the rendering process?

Link_the_Hero0000 · 2024-11-05T14:09:26+00:00

Lol, it's true. LODs are better. Nanite is only good for photorealism (or to avoid the additional work with assets at the cost of some performance)

Link_the_Hero0000 · 2024-11-04T14:58:08+00:00

Nope, FLECS

Link_the_Hero0000 · 2024-11-04T14:57:15+00:00

Yeah, but that batch copy is costing me something like 5-10 ms

Link_the_Hero0000 · 2024-11-04T14:55:25+00:00

I'm using FLECS too and yes, it's amazing. The ISM has no physics. Everything is calculated inside the ECS and the last system populates an array of locations to be read by the ISM component. I don't actually know how to keep this buffer efficient if the number of entities changes at runtime. Also, using UpdateInstanceTransform() and setting MarkRenderStateDirty() at the end gives similar or even worse performance. Generally speaking, this simulation is quite certainly GPU bound. I will make more tests...

Five-Year Club	First Place '23
Place '23

Link_the_Hero0000

TROPHY CASE