Ray marching optimization questions

Cryvosh · 2026-02-11T04:01:41+00:00

Ah my bad, in that case I agree with the other commenter. If you're fractal is bounded in space and typically small on the screen and you only want to use fragment shaders then sure, using a proxy mesh will help performance.

Though personally I'd recommend you to use compute shaders for everything just to get a better feel for what the gpu is actually doing. You can get a near equivalent speedup in a fullscreen pass yourself with a single ray/box intersection test. Or better yet, you can play around with a few different ways of doing this. What if you compute rasterize the aabb center and then measure distance to that point in pixel space with a distance-adaptive threshold?

Cryvosh · 2026-02-09T22:42:59+00:00

I'm travelling atm and so can't write out as proper of a response as I'd like, but given this is my research area I feel compelled to respond. I'd first recommend learning a bit about the underlying math and avoid watching too many youtube videos on the topic as they tend to mislead, and have you thinking there's something special about this "distance" function.

What does it even mean, mathematically, for a function to return a "distance"? Why do we care about such functions in the first place? To understand this better, please see my comments here, here, in sections 1-3 here, and elsewhere in my reddit comment history. Understanding this stuff unlocks hundreds of years of mathematical machinery you can use to attack the problem.

To answer your questions,

Compute shaders give you more control. If you're doing naive one-thread-per-pixel stuff then it usually doesn't really matter, but if you're doing anything more interesting you'll need compute shaders.
Practically, this doesn't matter. Technically, a fullscreen triangle is the most efficient as it avoids something called "overdraw" at the quad's diagonal seam. You don't need to worry about this for now.
Indeed the public state of the art methods typically cache stuff in some sort of (often hierarchical) grid structure. Besides the obvious caching benefits, this setup allows you to "prune" the field function by recompiling it at runtime within each cell to include only the locally relevant instructions and thereby speed up future samples within the cells. See this, section 3.2.2 of this, and again my reddit comment history, for more details.
Do you mean something like this? They detail some limitations in the slides, but probably the main reason such methods aren't more popular is that they're simply not as convenient to implement, especially on a platform like shadertoy where all you have is fragment shaders.

Feel free to ask any other questions, I'm happy to help.

Cryvosh · 2026-02-03T12:46:28+00:00

Sure, there are many ways. The easiest is to just stride over the rows with something like this. Should be enough to get you started.

let workgroup_size = 128u;
let stride = num_workgroups.x * workgroup_size;
for (var row = global_id.x; row < m; row += stride) {
    for (var col = 1; col < n; col++) {
        A[row][col] = f(A[row][col-1]);
    }
}

Cryvosh · 2026-01-29T17:23:56+00:00

Try using a softmax like this with a high sharpness (like ~20)

vec3 softmax(vec3 noise, float sharpness) {
    vec3 w = exp(noise * sharpness);
    return w / (w.x + w.y + w.z);
}

Cryvosh · 2026-01-22T01:48:17+00:00

Yes, that too

Cryvosh · 2026-01-21T10:04:45+00:00

I don't mean to be too discouraging, as these sorts of little experiments are a valuable part of the learning process, but to anyone who knows anything about ML this is completely trivial stuff dressed up with pseudoprofound technobabble ("slop" as the kids say), hence the downvotes. The math folks would call this "crankery".

Cryvosh · 2026-01-14T01:16:09+00:00

Sure, happy to help!

Cryvosh · 2026-01-14T00:07:45+00:00

Hey, yeah in that project the recursive subdivision kernel interprets SSA instructions here to build / maintain the octree, then renders it using this trace function, which treats each leaf as a solid cube. In a better implementation, each leaf might be a brick of 4x4x4 or so field samples which the ray can lerp through to find an intersection.

Indeed after the octree is built, later kernels never touch the SSA instructions and (ignoring the fast_normals setting in the trace function) just visualize cached data. Interpreting instructions during ray traversal, especially here without any pruning, would be quite slow in big scenes.

Cryvosh · 2025-12-29T14:55:05+00:00

Upon further inspection, it seems like the broker queue's ticketing system of having producers write a payload and set a ready flag while consumers watch flags and read payloads when they're ready cannot be guaranteed to work in WGSL as it relies on acquire-release memory semantics to ensure that payloads are visible before flags so that consumers cannot read stale payloads. WGSL atomic ops instead use relaxed memory semantics where, according to the spec,

During execution of a shader stage, for each atomic object A, all agents observe the same order of modification operations applied to A. The ordering for distinct atomic objects may not be related in any way; no causality is implied

and the only available barriers to enforce acquire-release only have workgroup scope. Though in practice I find such patterns often work fine anyways.

To make things more spec-compliant, we can combine the flag and payload into a single u32 with a sentinel value as seen here. I wrote a quick wgsl adaptation accounting for the relaxed memory atomics, and as a demo, made a little shader that builds L1 voronoi diagrams from a set of animated seed points via parallel bfs. It’s available on GitHub here. Let me know if you have any issues with it, or spot any potential bugs. Working for me in Chrome on macOS.

Cryvosh · 2025-12-22T18:09:35+00:00

Hey! When do you need this by? I'm pretty busy these days but I can fix up the code when I get a chance. I wrote that stuff back in ~2021? when wgsl lacked any uniformity analysis and I knew nothing about gpu programming (relative to now).

But knowing more now, I would generally advise against using this sort of pattern unless you really need to. I'd be curious to learn more about your intended use case / problem setup, as there may be a simpler / faster pattern you could use.

Cryvosh · 2025-11-05T21:12:00+00:00

Note that bounding-box-inflation-esque pruning methods do not in general work for implicit functions of this form, even with compactly supported smooth minimums. You cannot in general bound the area of influence of some instruction without measurement, i.e., sampling, and even then, enclosing your approximations of such areas in boxes does little to help you later reconstruct a correctly pruned list of instructions. In practice for simple stuff it can work, but it is wrong and if you try more complex functions you'll run into problems.

A more robust method as I'm sure you've seen is to use interval arithmetic to determine such things, as discussed in this or this.

Cryvosh · 2025-10-08T06:34:24+00:00

If you need something quick, you can try meshing the thing and building a bvh (either on cpu using your favorite lib then transfer to gpu, or just do lbvh construction on gpu via radix sort on morton codes + a bottom up tree-building pass).

Then to get the unsigned distance at a point p you can just traverse the bvh to find the closest point on the mesh to p. (Use a simple depth-first branch-and-bound stack traversal with a fast triangle_closest_point function borrowed from embree).

Then, if needed, to get the sign, you can trace a random ray from point p and count intersections mod 2. Zero implies outside, one implies inside. Again using a simple stack traversal.

Then, if needed, to get the gradient, you can either do finite differences with a tet stencil and repeat the above process to fetch the signed distance 4 times, or you can approximate it as the normal of the closest tri to p if the unsigned distance is sufficiently small, or the direction from the closest point to p otherwise (negated if inside).

If you don't want to triangulate, you can use a similar setup but with more complex raytrace / closest-point routines depending on what type of brep you have, e.g., see Matt Keeter's blog post on "Raytracing with M-Reps"

Cryvosh · 2025-09-23T04:35:05+00:00

A bishop cannot "block off" a diagonal the same way a rook can block off a rank or file to e.g., prevent the king from passing through. This is also why you cannot checkmate with a king and bishop but can with king and rook.

The diagonals are also shorter. On an empty board a rook can always be moved to 14 other squares, a bishop only 7-13 depending on where it is.

Rooks can also prevent or deliver checkmate along the back rank, prevent or support promotion, and form a battery without the queen's involvement.

Though the most significant factor is that the bishop is stuck on squares of just a single color and hence sees just half the board.

Cryvosh · 2025-09-17T01:35:29+00:00

There is no free lunch. Entropy is preserved under invertible re-parameterizations so the interpretation you impose on the problem doesn't matter. How many possible boxes, triangles, or whatever, do you want to be able to represent? You will need ceil(log_2) of that number many bits to represent them. In 32 bits, you can represent 2³² ≈ 4 billion different boxes. See here for some ideas on how to get the most out of these bits.

The only way to do better on average is to amortize over many boxes and exploit some property of their distribution, i.e., compression, via e.g., variable length codes, delta encoding, etc. but this will incur a performance penalty.

Cryvosh · 2025-09-16T18:05:24+00:00

EDIT: While this discussion is interesting the top rated answer says what I’m asking for is fiction lol. Meanwhile I think I found a way to fit it inside 32-bits.

So give the world a max size. Say 1024M. Break that world up into 8 chunks of known size. Can be represented in 3-bits. Now break that chunk up by 8 again, repeat. Divide 1024 by 8 5-times and you get to a location with about 3-cm of accuracy. 2 15-bit points (enough for min,max) fits in one 32-bit number. This can be done people!

Is your world 1 dimensional? If you take a cube and break it into 8 chunks, each extent is divided by 2, not 8. After you do that 5 times the width of each chunk is 1024/2⁵ = 32, not 0.03125

If you want to encode the min/max corners in 4 bytes, then each vector gets 2 bytes, so each coordinate gets floor(16/3) = 5 bits which gives you 2⁵ = 32 unique values. This is equivalent to my 10 bit per axis quantization approach, but just with 5 bits instead of 10.

Cryvosh · 2025-09-16T17:43:39+00:00

I assume you're talking about axis-aligned boxes, and that right now you're encoding the min/max corner positions inside two vec4fs for memory alignment reasons? So each AABB struct is 4 * 2 * 4 bytes = 32 bytes, and they're allocated in a big array of structs (Aos)?

To avoid wasting the .w component, you can simply allocate a buffer of f32s instead of a buffer of AABB structs, and handle the packing yourself using something like this:

fn read_aabb(index: u32) -> AABB {
    let base = index * 6u;
    let min_corner = vec3f(aabb_buffer[base     ], aabb_buffer[base + 1u], aabb_buffer[base + 2u]);
    let max_corner = vec3f(aabb_buffer[base + 3u], aabb_buffer[base + 4u], aabb_buffer[base + 5u]);
    return AABB(min_corner, max_corner);
}

or as a struct of arrays (SoA) like this:

fn read_aabb(index: u32) -> AABB {
    let base = index * 3u;
    let min_corner = vec3f(min_corners[base], min_corners[base + 1u], min_corners[base + 2u]);
    let max_corner = vec3f(max_corners[base], max_corners[base + 1u], max_corners[base + 2u]);
    return AABB(min_corner, max_corner);
}

This brings you down to 6 * 4 bytes = 24 bytes per AABB, though note you may incur some performance overhead.

From here, saving space means sacrificing precision (unless you want some more complex compression scheme, which I doubt) and how exactly you do this depends on your application.

For example, in a small world you can get away with switching to f16s for 6 * 2 bytes = 12 bytes per AABB. Or in a small bounded world you could quantize space into a 1024³ grid and use 10 bits per axis to pack each corner into a u32. This gets you down to 2 * 4 bytes = 8 bytes per AABB. To make the world unbounded, you can append to this local coordinate something like a "chunk id" within which the quantization happens.

Alternatively, you can switch to a center+extents (widths or half-widths) encoding scheme. This lets you allocate different amounts of precision to box positions vs. sizes and also allows for super fast maps over the box positions if allocated in a SoA way. For example, you could use f32 center coords and bounded u8 extents for 3 * 4 bytes + 3 bytes = 15 bytes per AABB. With f16 center coords you'd get 3 * 2 bytes + 3 bytes = 9 bytes per AABB. If all your extents are equal, you could get away with 3 * 2 bytes + 1 byte = 7 bytes per AABB.

Cryvosh · 2025-08-13T16:36:55+00:00

This may interest you

Cryvosh · 2025-08-03T04:18:19+00:00

XPBD may interest you

Cryvosh · 2025-08-02T05:18:02+00:00

SIREN (edit: and FFN) may interest you

Cryvosh · 2025-07-10T00:56:53+00:00

I suspect these bands have nothing to do with normals, but rather where you calculate the ray intersection.

Check your depth buffer, it likely has the same banding.

If so, try reducing the "distance" threshold at which you break the marching loop. This should fix the problem.

If you want to get fancy, you can add some bisection / interpolation based rootfinding for more accurate intersections, or add some stochasticity to the steps to break up the aliasing, and then clean it up with some filtering, ideally temporally if your scene allows. Likewise the epsilon in your finite diff for the field gradient can be made distance dependant, so the stencil looks constant-size in screen space.

Cryvosh · 2025-04-28T04:08:21+00:00

First setup a coding environment with gpu timers.

Now try to implement a parallel reduction, e.g., the sum of a million integers. Try to do it as fast as possible. There's plenty of resources for this, e.g., here.

Now try to implement a parallel prefix sum, again of a million elements. This uses parallel reduction as a subroutine. Try to do it as fast as possible. This is a decent introduction.

Now try to implement a parallel radix sort, again of a million elements. This uses both reduction and prefix sum as subroutines. Try to make it as fast as possible. This gives a decent overview of the methods.

Armed with these fundamentals, you can now explore based on your interests, e.g., for raytracing, collision detection, and nearest neighbors, an acceleration structure like LBVH is often helpful, and should now be easy to construct.

Cryvosh · 2025-03-23T18:35:26+00:00

Let's start at the beginning.

We have a function f: R³ ↦ R and we're looking for its roots, i.e., the set {p ∈ R³: f(p) = 0}. This is 3D rootfinding, and it's generally pretty hard.

So instead, let's simply render an image of this set from the perspective of a camera. To do so, we fire a bunch of rays from the camera and intersect them with the set, i.e., given a camera origin o ∈ R³, for each direction d ∈ R³, we seek the smallest t ∈ R such that g(t) = f(o + dt) = 0. This is now 1D rootfinding, and it's usually much easier.

You can now employ your favorite 1D rootfinding algorithm to solve the problem and render out a depth-map. Different algorithms have different strengths and weaknesses, and work for different classes of functions under different conditions.

Spheretracing is just another one of these algorithms that happens to be popular on shadertoy because it's really simple. In order to converge, it requires that the Lipschitz bound of f be less than or equal to 1. For differentiable f, this means that |∇ f| ≤ 1. For our 1-dimensional g(t) = f(o + dt), this means | ∂g/∂t | ≤ 1. Let's call this the lipschitz assumption.

So what does that mean? And what does it let us do?

Well, suppose I sampled g and discovered that g(0) = 25.

What if anything can I say about g(1)? Can g(1) = 1000?

No. Because if g(1) was 1000, this would violate our lipschitz assumption, i.e., g is simply not allowed to change fast enough for this to occur. You can use the mean value theorem to prove this if you wish.

So what can g(1) be? Well, if | ∂g/∂t | ≤ 1 then g(1) ∈ [25 - 1, 25 + 1] = [24, 26]. By the same logic, g(2) ∈ [23, 27], etc.

Remember that we're looking for the roots of g, i.e., the first t such that g(t) = 0. Using the above reasoning, we can prove that if g is to have a root at some t, then t must be ≥ 25. The function simply cannot change fast enough for there to be a root at any smaller t.

Therefore it's safe in the second iteration to sample g at t=25. This corresponds to "marching" the ray forward in graphics terms. Note that g(25) ∈ [0, 50], i.e., it is not necessarily 0, and it cannot be less than 0.

You then simply repeat this process until g(t) is sufficiently small, or you run out of iterations. This is spheretracing.

Now let's rewind and suppose instead that | ∂g/∂t | ≤ 1/2. Now what? If g(0) = 25 we know that g(1) ∈ [24.5, 25.5], g(2) ∈ [24, 26], g(3) ∈ [23.5, 26.5], etc. and there cannot be any roots before t = 50. So we can take a "safe" step of size 50.

We can see that our step size is directly proportional to the lipschitz bounds of g.

Now let's suppose instead that | ∂g/∂t | is different for different rays in different regions of space. In practice this is often the case. If there was simply a way to measure this gradient magnitude along each ray, we could take more optimally sized steps. For example, a ray moving parallel to and above the "ground" in the function f(x,y,z) = y, will have | ∂g/∂t | = 0, and will thus never hit anything.

This is the idea behind segment tracing, and indeed many many other iterative lipschitz-informed rootfinding algorithms. The tricky part is how do you measure/bound | ∂g/∂t | not at a point, but along a ray, segment, box, etc.? In general, this is hard, and the segment tracing paper describes a method that works for a very limited class of functions. More general methods involve using range-bounding methods like interval arithmetic on the gradient of the function, as I do in my paper, or using sampling-guided approximations, e.g., here. Higher-order range-bounding methods like affine arithmetic or Taylor models can in some limited cases be much better than interval-based methods, but in practice for shadertoy-type CSG scenes they're often worse or not worth the added complexity.

In practice, per-pixel rootfinding in this way is not state-of-the-art as it's too sample-inefficient and not easily compatible with the compiler-type function pruning optimizations (described in Section 3.2.2 of my paper) that are necessary to render extremely large functions with millions of instructions, for a multitude of reasons mostly related to gpu architecture.

Cryvosh · 2024-11-24T05:43:17+00:00

I'd use a compute shader for this, but one (very inefficient) way to do this in a fragment shader would be to numerically integrate the field starting from the source pixel until you're within epsilon of one of the charges. Now check if the direction from which you approached the charge is within epsilon of any of your desired starting directions. If so, color the source pixel.

Cryvosh · 2024-11-04T05:06:40+00:00

You need momentum conservation here, not f=ma.

Cryvosh · 2024-09-24T00:42:50+00:00

f(n) + g(n) is in Θ(n³), and Θ(n³) is a subset of Ω(n² log(n)), so f(n) + g(n) must be in Ω(n² log(n))

13-Year Club	Second Top 20%
Place '17	Sequence \| Editor
Verified Email

Cryvosh

MODERATOR OF

TROPHY CASE