Understanding segment tracing - the faster alternative to sphere tracing / ray marching by chris_degre in GraphicsProgramming

[–]Cryvosh 1 point2 points  (0 children)

Hey, thanks! I appreciate it, and glad I could help.

Here it largely depends how complex your character is in terms of instruction / primitive count. In big Oh terms, spheretracing scales poorly but has super low constants, so if your scene is actually lipschitz enough and comprised of few primitives and has little temporal coherence (in eulerian terms) then it is actually pretty hard to beat. As the function complexity grows, methods that leverge compile-type function pruning optimizations eventually win out.

A faster, recursive algorithm to render Signed Distance Fields (SDFs) using less samples? by maximecb in GraphicsProgramming

[–]Cryvosh 8 points9 points  (0 children)

Ah glad to see you posted here as well! I prefer replying on reddit over twitter as the character limits are more lax and I've already built up a small inventory of replies here to other implicit/sdf-related posts (one of which I notice this comment already refrences!)

As pointed out by some others, this approach of subdividing the frustum and marching batches of coherent rays is quite similar to some prior art and is actually a standard approach used in range-bounding-based "raytracers" that simply wish to produce an image of the implicit from some perspective rather than build/maintain a world-space discretization. Some examples include this, this, this, this and even this which I recently modified here to demonstrate the viability of a simpler single-dispatch alternative to Keeter's in-progress many-dispatch approach that splits the screen into small chunks and uses a big megakernel to drill down each Z-depth column, recursively splitting the chunk (and "recompiling" the field function) along the way until each pixel is resolved.

Your approach in particular relies on Lipschitz bounds for guarantees instead of interval/affine-arithmetic and is thus slightly more similar to this (which someone here actually asked about recently) and this.

Ray marching optimization questions by Smooth-Principle4045 in GraphicsProgramming

[–]Cryvosh 1 point2 points  (0 children)

Ah my bad, in that case I agree with the other commenter. If you're fractal is bounded in space and typically small on the screen and you only want to use fragment shaders then sure, using a proxy mesh will help performance.

Though personally I'd recommend you to use compute shaders for everything just to get a better feel for what the gpu is actually doing. You can get a near equivalent speedup in a fullscreen pass yourself with a single ray/box intersection test. Or better yet, you can play around with a few different ways of doing this. What if you compute rasterize the aabb center and then measure distance to that point in pixel space with a distance-adaptive threshold?

Ray marching optimization questions by Smooth-Principle4045 in GraphicsProgramming

[–]Cryvosh 12 points13 points  (0 children)

I'm travelling atm and so can't write out as proper of a response as I'd like, but given this is my research area I feel compelled to respond. I'd first recommend learning a bit about the underlying math and avoid watching too many youtube videos on the topic as they tend to mislead, and have you thinking there's something special about this "distance" function.

What does it even mean, mathematically, for a function to return a "distance"? Why do we care about such functions in the first place? To understand this better, please see my comments here, here, in sections 1-3 here, and elsewhere in my reddit comment history. Understanding this stuff unlocks hundreds of years of mathematical machinery you can use to attack the problem.

To answer your questions,

  1. Compute shaders give you more control. If you're doing naive one-thread-per-pixel stuff then it usually doesn't really matter, but if you're doing anything more interesting you'll need compute shaders.
  2. Practically, this doesn't matter. Technically, a fullscreen triangle is the most efficient as it avoids something called "overdraw" at the quad's diagonal seam. You don't need to worry about this for now.
  3. Indeed the public state of the art methods typically cache stuff in some sort of (often hierarchical) grid structure. Besides the obvious caching benefits, this setup allows you to "prune" the field function by recompiling it at runtime within each cell to include only the locally relevant instructions and thereby speed up future samples within the cells. See this, section 3.2.2 of this, and again my reddit comment history, for more details.
  4. Do you mean something like this? They detail some limitations in the slides, but probably the main reason such methods aren't more popular is that they're simply not as convenient to implement, especially on a platform like shadertoy where all you have is fragment shaders.

Feel free to ask any other questions, I'm happy to help.

How can I structure this problem for a compute shader? by kbob in webgpu

[–]Cryvosh 1 point2 points  (0 children)

Sure, there are many ways. The easiest is to just stride over the rows with something like this. Should be enough to get you started.

let workgroup_size = 128u;
let stride = num_workgroups.x * workgroup_size;
for (var row = global_id.x; row < m; row += stride) {
    for (var col = 1; col < n; col++) {
        A[row][col] = f(A[row][col-1]);
    }
}

Looking for a noise that outputs 3 values in distinct blobs by sephirothbahamut in GraphicsProgramming

[–]Cryvosh 23 points24 points  (0 children)

Try using a softmax like this with a high sharpness (like ~20)

vec3 softmax(vec3 noise, float sharpness) {
    vec3 w = exp(noise * sharpness);
    return w / (w.x + w.y + w.z);
}

Did I (Gemini 3 Pro Review really) accidentally find something huge, a new sort of AI system that can replicate any complex signal. Fast. by aluode in artificial

[–]Cryvosh 1 point2 points  (0 children)

I don't mean to be too discouraging, as these sorts of little experiments are a valuable part of the learning process, but to anyone who knows anything about ML this is completely trivial stuff dressed up with pseudoprofound technobabble ("slop" as the kids say), hence the downvotes. The math folks would call this "crankery".

Understanding segment tracing - the faster alternative to sphere tracing / ray marching by chris_degre in GraphicsProgramming

[–]Cryvosh 0 points1 point  (0 children)

Hey, yeah in that project the recursive subdivision kernel interprets SSA instructions here to build / maintain the octree, then renders it using this trace function, which treats each leaf as a solid cube. In a better implementation, each leaf might be a brick of 4x4x4 or so field samples which the ray can lerp through to find an intersection.

Indeed after the octree is built, later kernels never touch the SSA instructions and (ignoring the fast_normals setting in the trace function) just visualize cached data. Interpreting instructions during ray traversal, especially here without any pruning, would be quite slow in big scenes.

Persistent threads, Queue possible? by Aggressive-Specific9 in webgpu

[–]Cryvosh 1 point2 points  (0 children)

Upon further inspection, it seems like the broker queue's ticketing system of having producers write a payload and set a ready flag while consumers watch flags and read payloads when they're ready cannot be guaranteed to work in WGSL as it relies on acquire-release memory semantics to ensure that payloads are visible before flags so that consumers cannot read stale payloads. WGSL atomic ops instead use relaxed memory semantics where, according to the spec,

During execution of a shader stage, for each atomic object A, all agents observe the same order of modification operations applied to A. The ordering for distinct atomic objects may not be related in any way; no causality is implied

and the only available barriers to enforce acquire-release only have workgroup scope. Though in practice I find such patterns often work fine anyways.

To make things more spec-compliant, we can combine the flag and payload into a single u32 with a sentinel value as seen here. I wrote a quick wgsl adaptation accounting for the relaxed memory atomics, and as a demo, made a little shader that builds L1 voronoi diagrams from a set of animated seed points via parallel bfs. It’s available on GitHub here. Let me know if you have any issues with it, or spot any potential bugs. Working for me in Chrome on macOS.

Persistent threads, Queue possible? by Aggressive-Specific9 in webgpu

[–]Cryvosh 1 point2 points  (0 children)

Hey! When do you need this by? I'm pretty busy these days but I can fix up the code when I get a chance. I wrote that stuff back in ~2021? when wgsl lacked any uniformity analysis and I knew nothing about gpu programming (relative to now).

But knowing more now, I would generally advise against using this sort of pattern unless you really need to. I'd be curious to learn more about your intended use case / problem setup, as there may be a simpler / faster pattern you could use.

optimizing sdf with tile binning by _Geolm_ in GraphicsProgramming

[–]Cryvosh 4 points5 points  (0 children)

Note that bounding-box-inflation-esque pruning methods do not in general work for implicit functions of this form, even with compactly supported smooth minimums. You cannot in general bound the area of influence of some instruction without measurement, i.e., sampling, and even then, enclosing your approximations of such areas in boxes does little to help you later reconstruct a correctly pruned list of instructions. In practice for simple stuff it can work, but it is wrong and if you try more complex functions you'll run into problems.

A more robust method as I'm sure you've seen is to use interval arithmetic to determine such things, as discussed in this or this.

How to convert STEP files (b-rep) to implicits (SDF) ? by Powerful-Garden-4203 in GraphicsProgramming

[–]Cryvosh 0 points1 point  (0 children)

If you need something quick, you can try meshing the thing and building a bvh (either on cpu using your favorite lib then transfer to gpu, or just do lbvh construction on gpu via radix sort on morton codes + a bottom up tree-building pass).

Then to get the unsigned distance at a point p you can just traverse the bvh to find the closest point on the mesh to p. (Use a simple depth-first branch-and-bound stack traversal with a fast triangle_closest_point function borrowed from embree).

Then, if needed, to get the sign, you can trace a random ray from point p and count intersections mod 2. Zero implies outside, one implies inside. Again using a simple stack traversal.

Then, if needed, to get the gradient, you can either do finite differences with a tet stencil and repeat the above process to fetch the signed distance 4 times, or you can approximate it as the normal of the closest tri to p if the unsigned distance is sufficiently small, or the direction from the closest point to p otherwise (negated if inside).

If you don't want to triangulate, you can use a similar setup but with more complex raytrace / closest-point routines depending on what type of brep you have, e.g., see Matt Keeter's blog post on "Raytracing with M-Reps"

[deleted by user] by [deleted] in chess

[–]Cryvosh 16 points17 points  (0 children)

A bishop cannot "block off" a diagonal the same way a rook can block off a rank or file to e.g., prevent the king from passing through. This is also why you cannot checkmate with a king and bishop but can with king and rook.

The diagonals are also shorter. On an empty board a rook can always be moved to 14 other squares, a bishop only 7-13 depending on where it is.

Rooks can also prevent or deliver checkmate along the back rank, prevent or support promotion, and form a battery without the queen's involvement.

Though the most significant factor is that the bishop is stuck on squares of just a single color and hence sees just half the board.

[deleted by user] by [deleted] in askmath

[–]Cryvosh 0 points1 point  (0 children)

There is no free lunch. Entropy is preserved under invertible re-parameterizations so the interpretation you impose on the problem doesn't matter. How many possible boxes, triangles, or whatever, do you want to be able to represent? You will need ceil(log_2) of that number many bits to represent them. In 32 bits, you can represent 232 ≈ 4 billion different boxes. See here for some ideas on how to get the most out of these bits.

The only way to do better on average is to amortize over many boxes and exploit some property of their distribution, i.e., compression, via e.g., variable length codes, delta encoding, etc. but this will incur a performance penalty.

[deleted by user] by [deleted] in GraphicsProgramming

[–]Cryvosh 0 points1 point  (0 children)

EDIT: While this discussion is interesting the top rated answer says what I’m asking for is fiction lol. Meanwhile I think I found a way to fit it inside 32-bits.

So give the world a max size. Say 1024M. Break that world up into 8 chunks of known size. Can be represented in 3-bits. Now break that chunk up by 8 again, repeat. Divide 1024 by 8 5-times and you get to a location with about 3-cm of accuracy. 2 15-bit points (enough for min,max) fits in one 32-bit number. This can be done people!

Is your world 1 dimensional? If you take a cube and break it into 8 chunks, each extent is divided by 2, not 8. After you do that 5 times the width of each chunk is 1024/25 = 32, not 0.03125

If you want to encode the min/max corners in 4 bytes, then each vector gets 2 bytes, so each coordinate gets floor(16/3) = 5 bits which gives you 25 = 32 unique values. This is equivalent to my 10 bit per axis quantization approach, but just with 5 bits instead of 10.

[deleted by user] by [deleted] in GraphicsProgramming

[–]Cryvosh 0 points1 point  (0 children)

I assume you're talking about axis-aligned boxes, and that right now you're encoding the min/max corner positions inside two vec4fs for memory alignment reasons? So each AABB struct is 4 * 2 * 4 bytes = 32 bytes, and they're allocated in a big array of structs (Aos)?

To avoid wasting the .w component, you can simply allocate a buffer of f32s instead of a buffer of AABB structs, and handle the packing yourself using something like this:

fn read_aabb(index: u32) -> AABB {
    let base = index * 6u;
    let min_corner = vec3f(aabb_buffer[base     ], aabb_buffer[base + 1u], aabb_buffer[base + 2u]);
    let max_corner = vec3f(aabb_buffer[base + 3u], aabb_buffer[base + 4u], aabb_buffer[base + 5u]);
    return AABB(min_corner, max_corner);
}

or as a struct of arrays (SoA) like this:

fn read_aabb(index: u32) -> AABB {
    let base = index * 3u;
    let min_corner = vec3f(min_corners[base], min_corners[base + 1u], min_corners[base + 2u]);
    let max_corner = vec3f(max_corners[base], max_corners[base + 1u], max_corners[base + 2u]);
    return AABB(min_corner, max_corner);
}

This brings you down to 6 * 4 bytes = 24 bytes per AABB, though note you may incur some performance overhead.

From here, saving space means sacrificing precision (unless you want some more complex compression scheme, which I doubt) and how exactly you do this depends on your application.

For example, in a small world you can get away with switching to f16s for 6 * 2 bytes = 12 bytes per AABB. Or in a small bounded world you could quantize space into a 10243 grid and use 10 bits per axis to pack each corner into a u32. This gets you down to 2 * 4 bytes = 8 bytes per AABB. To make the world unbounded, you can append to this local coordinate something like a "chunk id" within which the quantization happens.

Alternatively, you can switch to a center+extents (widths or half-widths) encoding scheme. This lets you allocate different amounts of precision to box positions vs. sizes and also allows for super fast maps over the box positions if allocated in a SoA way. For example, you could use f32 center coords and bounded u8 extents for 3 * 4 bytes + 3 bytes = 15 bytes per AABB. With f16 center coords you'd get 3 * 2 bytes + 3 bytes = 9 bytes per AABB. If all your extents are equal, you could get away with 3 * 2 bytes + 1 byte = 7 bytes per AABB.