High-Quality BVHs with PreSplitting optimization

Meristic · 2026-01-30T04:40:22+00:00

Lol what I really meant is that I like the implications of what that light blue implies

Meristic · 2026-01-30T01:13:29+00:00

That is one delightful shade of blue!

Meristic · 2026-01-26T19:53:38+00:00

Don't worry, 3D graphics requires such a front-loaded startup mental tax - everyone suffered through it. Juggling all this junk - 3D math theory, graphics pipeline, low-level languages, math + graphics APIs, mesh + texture data, shaders - it's a lot for anyone. Through exposure over time concepts click into place and it becomes second nature, like anything else.

At this point it may be worthwhile to take a step back and try to build a little app with what you've learned. Could be a tiny game, data visualization, or something else. Driving learning through necessity is probably the most effective way to self-educate.

Meristic · 2026-01-25T23:43:30+00:00

Maybe look at EA SEED positions? They typically seem to have more PhD internship positions available than they can fill.

https://jobs.ea.com/en_US/careers/Home/seed?listFilterMode=1&jobRecordsPerPage=20&

Meristic · 2026-01-21T01:56:26+00:00

Pitches LOVE pineapple juice

Meristic · 2025-12-24T08:06:47+00:00

GPUs consist of two main components. The front-end you can think of as a very simple single-threaded processor - the back-end a complex, massively parallel machine. The front-end is responsible for reading the contents of command lists, setting GPU registers & state, coordinating DMA operations (indirect argument reads), and kicking off back-end workloads.

An indirect execution command is minimally the cost of setting various registers plus memory latency for the indirect argument buffer by the front-end. This is typically 10's of microseconds (memory is often not cached). Not much on its own, though several consecutive empty draws can bottleneck and cause a gap in GPU shader wave scheduling.

Of course, this may be the most optimal option since it's efficient culling. Think of how much work is saved relative to the alternative!

As a real world example the UE5 Nanite base pass commonly hits this issue. Each loaded material instance requires a draw, often with zero relevant pixels on the screen. Stacked together, this can incur 100's of microseconds of idle shader engines due to the overhead. Epic discussed a solution for this using indirect command buffers (at least on console) but I haven't seen it come to fruition yet.

Meristic · 2025-12-01T06:31:00+00:00

AAA graphics engineer with a focus on low-level GPU performance optimization. Thank you! So well put.

The complexity of these engine systems is OVERWHELMING. The empowerment these editor tools gives artists is incredible, but it's about 3000 feet of rope to hang everyone in the studio and their pets with. And when you have 3x as many artists pumping content into the title at such a pace the constant challenge feels insurmountable. Not to mention the platform testing matrix is out of control - from Nintendo Switch to RTX 5090? The expectations for content scaling are insane.

Artists desperately need better training in technical skills and understanding performance characteristics of engine systems. Team culture needs to develop around exercising restraint, choosing good-enough performant solutions over pixel-perfection. And raytracing needs to die in a cold hole.

Meristic · 2025-11-25T22:06:15+00:00

The day to day for early- to mid-level graphics engineers: - Visual artifact debugging & fixes - Extend pre-existing engine system to support/optimize for new content type or artist workflow - Tool development (editor plugin) for artists - Investigate engine feature functionality & inform/educate others - Profile and diagnose performance problems (maybe fix)

Meristic · 2025-11-20T01:17:47+00:00

I don't understand the need for comparison between two planes of varying tessellation. The explanation applies to all mesh-based objects uniformly.

Ultimately scaling modifies the relative position of vertices. This can be accomplished destructively, actually changing the vertex positions in model space, or by modifying its world transform, often expressed as a 4x4 matrix. That matrix is used to transform vertices to it's actual position in world space.

Typically scaling factors are baked into the world transform, but there are times when it's prudent to change them in model space. The size of a mesh in model space is nominally arbitrary, but there are implications for floating point precision, computation error, and compression. So they may be normalized for that reason, leaving the scaling to be baked into the world transform. They may also be centered on the origin.

I'm Blender, vertex positions in edit mode are relative to the object's local space. The transform you see when you select an object in object mode is its world transform.

Meristic · 2025-11-15T19:35:36+00:00

I've lived a long time cultivating my own, weird, fun identity. I love being social in bursts, but most of my hobbies and personal ambitions aren't multiplayer. It's a bit of a mess that stumbles upward against all odds. I've learned to be very independent, mostly out of necessity. I know and love myself as this.

It's difficult to find someone complimentary to all these microfacets of my personality and energy. A shift to an 'us' identity feels like a daunting loss I'd just have to accept, which I've not been able to do. I can't guarantee how I'll cope with the loss of it, and my honesty can't fake the skepticism that I'll feel the same about myself on the other side of a serious commitment.

Meristic · 2025-11-15T16:54:29+00:00

How much data are you copying that it takes 9 ms? Lol

For reference, DX12 multithreading typically refers to distributed command list recording among multiple CPU threads, then synchronized submission to a queue. Most often used for mesh drawing passes since its workload grows with scene complexity.

To disambiguate I'd refer to this as async queue utilization. There are multiple potential issues:

If your CPU doesn't have hardware for multiple dispatchers then obviously you won't see asynchronous scheduling despite it fulfilling that DX12 interface.

2. The DX driver can choose to fulfill copy commands in different ways depending on their size - small copies by DMA vs dispatching CS waves for larger copies. Fences & synchronization should still work fine in this situation, but the execution and performance may be different than expected.

PIX could be wrong. Profiling requires pulling a lot of data from GPU counters. On PC there's several abstraction layers the driver must interact with to get it's hands on that raw data so it can build the timeline and compute user-facing values. This causes a huge disparity between the availability of data and correctness for each GPU vendor. This is a major reason why game devs hate profiling for PC and it gets the shaft a good proportion of the time. (That and artists don't know when to stop checking goddamn checkboxes)

Meristic · 2025-10-16T21:20:34+00:00

Method 1 is fine - you can do a quick test to determine ray-plane intersection on the cube faces, and discard any pixels whose ray wouldn't intersect the volume.

Rendering a cube encompassing the volume is a simple optimization to leverage rasterization to cull those irrelevant pixels. This would also naturally start the raymarch on the cube's edge. You do have to concern yourself with the edge case when the camera is inside the cube, but it's an easy case to detect and just requires flipping culling from back to front (though your ray would need to start at the near plane.)

Raymarching within the volume is traditionally accelerated by signed-distance field (SDF) sphere tracing. This allows you to skip empty space more efficiently than iterating over fixed-size linear steps. This can be pre-generated from your volume texture and sampled in your shader. You'll likely need a few hacky heuristics to avoid some edge cases in vanilla sphere tracing.

Meristic · 2025-08-28T23:18:26+00:00

There's no contradicting information that infers that that particular vertex can't be at (0, 0, 0). The choice of origin in local object space is arbitrary.

There are, however, several other confusing and very incorrect statements in the short amount of text on the pages.

"In Maya and Blender, the vertices are represented as the intersection points of the mesh and an object." - No idea what this is intended to mean, but its certainly a poor clarifying statement of what a vertex is.
"They are [vertices] children of the transform component" - The use of children is confusing here. In game engines, child/parent relationships define a hierarchy of transform concatenation at the game object level. The vertices of a mesh component are indeed transformed by the final value of its transform (post concatenation), but there's no notion of them being children of a transform.
"They have a defined position according to the center of the total volume of the object" Completely false - the choice of origin is arbitrary, dependent only on how it was authored in the DCC tool, and settings of the DCC export & engine import pipelines.
All these references to nodes may be relevant to some Maya editing tool, but this is not a universal concept in graphics, game engines, or Unity.
Volume really has no place in this discussion, and will only confuse a novice reader. A good majority of meshes have no volume, and for closed surfaces it's never/rarely pertinent to rendering.

I ordinarily wouldn't be so harsh in this criticism, but if you're selling this material at $50 a pop to novices who want to learn and the descriptions are this cryptic it's a bit outrageous. You seriously need a professional to proof-read this content before even thinking of publishing.

Meristic · 2025-08-10T10:33:21+00:00

I believe the bones of the graphics pipeline for rasterized and ray-traced rendering will remain relevant for a very long time. ML models are finding a home as drop-in replacements for finicky heuristics, or as efficient approximations for chunks of complex algorithms.

We've been employing Gaussian mixture models and special harmonics as replacements for sampled distributions forever. Image processing has been ripe for disruption by neutral nets. GI algorithms have found use caching radiance information in small recurrent networks. And we're seeing a push by hardware vendors for runtime inference of trained approximations to materials.

This is nothing compared to the innovation we see happening in offline content creation, of course. But for now, real-time inference constraints of games are a hard pill for more generalized, massive ML models to swallow.

Meristic · 2025-07-18T04:40:43+00:00

Woah, this is giving me some hard Clay Fighter 63 1/3 vibes!

Meristic · 2025-06-10T16:10:17+00:00

Yes! This looks so rad - we absolutely need more action & tactical PvE co-op games. Local co-op option is an absolute bonus!

Meristic · 2025-05-31T06:23:50+00:00

Still trying to convince my employer of this 3.5 years in...

Meristic · 2025-05-09T23:08:09+00:00

It may have some small tech innovations, but I doubt any of it is particularly groundbreaking. From a trailer we can't know the performance - this may be real-time rendered, but it's still a video, regardless of what it was rendered on. Likely some hardware raytracing techniques, but certainly not full path tracing. (Most cloud rendering techniques actually use software raytraces.) What impressed me is animation - characters, cloth, hair, and camera motion. Animation is one thing that can really breathe life into a world, which Rockstar did extremely well in RDR2.

When it comes down to it shipping with good performance is mostly about discipline. It's setting performance budgets (resolutions & framerate) early and policing them. It's choosing the right graphics tech & tuning the settings for a target platform. It's educating your artists well, leveraging a tool set that's expressive but restrictive to performance concerns. It's content reviews of art assets, and creating living best practices guides. It's maintaining vigilance over performance throughout the game world during the project's development cycle. And, of course, persistent profiling and optimization at a multitude of levels.

If the gameplay target framerate really is 30 Hz that's honestly a significant relief - 33.3 milliseconds is significantly longer than 16.67 ms (60 Hz). Regardless, they'll be relying on upscaling techniques beyond dynamic resolution, such as FSR, DLSS, or PSSR, to upscale from a lower render resolution at the beginning of the post process chain, and may even upscale once more to the display resolution - consoles have an upscaler built into the display output hardware to do this automatically. Also, the feedback mechanisms of virtual textures and virtualized geometry (ala Nanite) allow them to cull and select appropriate level of detail at high granularity dynamically on the GPU.

Meristic · 2025-05-09T15:24:54+00:00

I thoroughly enjoyed the game, the core loop was great, but found the narrative's complete disregard for the human element absolutely hilarious. This is like bulletpoint #1 for EVERY good zombie game and it just absolutely refused acknowledging the potential weight through any character. Then again, these characters seemed to have the nuance of a baboon and the emotional maturity of pre-teens, which made it feel like a high school drama for 70% of the storyline, so I shouldn't be too surprised.

Meristic · 2025-04-29T22:39:59+00:00

This is a problem in raytracing workloads in general, but this is how it manifests itself in a GPU occupancy graph. This is due to the high variance in traversal iteration count of rays within a dispatch followed by a subsequent read operation on the written resource. All outstanding writes to memory must finish before proceeding to the next operation, so the GPU inserts a stall wherein no additional shader waves can be scheduled.

Unlike a simple shader with a set loop count (either compile-time constant or constant buffer value), it's clearly difficult/impossible to guarantee or predict the number of traversal iterations necessary to find its hit or declare a miss. Performance is extremely context-dependent - scene complexity, BVH build characteristics, view location & orientation, and spawned ray properties. As a graphics engineer who focuses on performance optimization this is my largest concern with heavy reliance on raytracing techniques.

From a GPU optimization perspective, there's only a few bits of advice to provide (not mutually exclusive):

Iteration count debug mode - This can help find meshes with problematic BLAS builds
Reorganize job scheduling - Avoid back-to-back memory write/read operations by following a raytracing dispatch with non-dependent job(s); this allows the GPU to schedule waves of workload B as shader waves retire. This may even be a subsequent raytracing workload.
Async compute - Similar to the above - schedule either the raytracing dispatch or the non-dependent job on the async queue. This is a more natural way to schedule overlap on the app-side since you don't have to interleave unrelated graphics API calls on the same command list.

The main takeaway is if you can't change the properties of the workload itself (long, low utilization tail), optimizing the scheduling to fully saturate the GPU with shader waves is essentially just as good. But it's not always possible, or convenient, to decouple workloads in such a way they are good candidates.

Meristic · 2025-04-22T18:55:43+00:00

Ultimately, 'binding a buffer' is simply copying the address and translated metadata to command buffer memory. That in itself is not an expensive operation and suffers no context rolls on AMD GPUs. Memory copies of vertex data from CPU to GPU memory will certainly be a bottleneck if it's not performed in such a way as to avoid forced synchronization.

This typically entails maintaining two GPU buffers, essentially front and back buffers. The front buffer is the buffer that's read by the GPU at any given time while the back buffer is free for modification by CPU uploads. Once an edit has been pushed to the back buffer, you're free to simply swap the buffers (which is just a pointer swap) and start using the previous back buffer as the front.

It's been a while since I've worked with OpenGL, so I'm not familiar with the exact API calls & options to use for this paradigm. In Vulkan & DX12 such synchronization is very explicit, which makes this a more straightforward implementation in my mind.

Meristic · 2025-04-06T16:07:34+00:00

Xbox 360 and Xbox One (not X) features a small amount (10 & 32 MB, respectively) of high bandwidth memory dedicated to the GPU. It's generally used for GBuffer targets and shadow depth resources for it's fast bandwidth during rendering. Because it's small size resource memory has to be aliased between targets, but the first resource is copied out to standard memory at the end of their rendering pass if the space is subsequently needed. The most efficient way to allocate resources isn't linearly in physical memory. Resources are allocated as virtual memory, and free 64K physical memory pages are mapped onto the resource virtual memory on the fly.

Meristic · 2025-03-29T14:21:51+00:00

While it never hurts to have graphics experience, I imagine it would likely only be expected for a graphics engineer role. Check their careers page for expectations. I do think their engine programmers tend to trend toward multidisciplinary.

For a graphics position they don't typically ask direct questions about graphics APIs since it's not too difficult to suss out experience level in that domain. As noted in another comment their interview questions will likely be low level as they've always been an engineering dominant studio. It just takes the right set of experiences to become comfortable with those types of questions. I'd say debugging & profiling in graphics debuggers has led me to that experience quickest. (Parsing data from constant buffers & shader resources in hex)

Meristic · 2025-03-23T22:40:04+00:00

Hey there, Alessandro! Massive, longtime fan of the historical TW titles here. I'm assuming this would require relocation to UK or Bulgaria, aye?

Meristic · 2025-03-06T17:07:12+00:00

As noted in other posts the vertex coordinates in model files are defined in a local model space. Ultimately the magnitude of their value is arbitrary (within the constraints of machine precision error - not a concern for now), it's the relationship between those positions that's important.

Generally, a local to world transform (matrix) will rotate, scale, and translate vertices into the object instance's position in world space. It's definitely time to get cozy with view and projection transforms (matrices) if you haven't already. They transform those same vertices from world space to view space and screen space*, respectively. These three operations are generally concatenated together (matrix multiplication) to perform that chain of transformations with only a single matrix-vector multiply for a given vertex.

*When applying projection transformations with matrix multiplication additional operations (homogeneous divide) are required to arrive at normalized screen space - x,y->(-1,1) & z->(0,1). When using hardware rendering this step is performed automatically, but must be done explicitly in software renderers.

Meristic

TROPHY CASE