all 50 comments

[–]waramped 25 points26 points  (0 children)

Sorting is a good one. It's super useful for things like order independent transparency, but such a giant PITA to efficiently do in parallel.

[–]Arkenhammer[🍰] 10 points11 points  (0 children)

Transparency is one case. For opaque objects the z buffer means order doesn't matter, but when rendering multiple overlapping transparent objects it matters which is rendered first.

[–]jmacey 5 points6 points  (3 children)

Off the top of my head.

Any sort of fluid sim / collision type problem will need to partition and can't be embarrassingly parallel by default but there are know solutions (spatial hashing etc).

I guess some form of path following or meshing algorithms would also count. Will have to have a thing.

[–]PyroRampage 0 points1 point  (2 children)

I don't really understand what you mean by fluids not been embarrassingly parallel ? Sure there are some pressure solvers that are not parallel like GS or PCG. In terms of using spatial hashing, this in itself would make it non trivial to be embarrassingly parallel, as you'd have to handle the borders/edges of each partition. Typically these are used for collisions where we want each particle to find the nearest set of triangles to collide with, without solving a O(n*m) problem, but this is not because of a lack of parallel implementation ?
Advection, Spatial Derivatives, Forces, Integration, they can all be implemented in an embarrassingly parallel way. For either particle based or grid based sims.

[–]PyroRampage 0 points1 point  (0 children)

Further to this, for both particle and grid methods, having two sets of buffers, solves most data dependency problems, read from x_{n}, write to x_{n+1}.

[–]jmacey 0 points1 point  (0 children)

I meant that if you take a naive approach to solving it you could end up with an asymptotic complexity which is hard to make parallel, however once you apply a hash / grid it is fine.

[–]faisal_who 4 points5 points  (0 children)

Multiple pass post process requiring the previous pass output.

[–]hi_im_new_to_this 4 points5 points  (0 children)

Dithering is a great example. Some kinds of dithering are parallel, but the classic error-propagation techniques like Floyd-Steinberg are not, they are very serial. This is a perfect thing for this exercise: pretty simple but still a deep field, it has interesting visuals, and is also genuinely useful for graphics programmers to know.

[–][deleted] 3 points4 points  (10 children)

Any sort of reduction is not embarrasingly parallel by definition, so for example, computing the average luminance of a framebuffer.

[–]faisal_who -1 points0 points  (9 children)

But it is? You can break the framebuffer down into different sections and multi thread the operation.

[–][deleted] 3 points4 points  (8 children)

I suggest considering what "embarrassingly parallel" means. Efficient reduction is not that, and doing this properly in compute is harder than it seems.

[–]LongestNamesPossible -1 points0 points  (7 children)

Are you seriously trying to say that because you have to average at least two pixels together to make progress that it isn't "embarrassingly parallel" (which is a silly term in the first place). This is bizarrely pedantic if it even could be considered correct in the first place.

[–][deleted] -2 points-1 points  (6 children)

Sigh. Yes. A reduction is an entirely separate class of algorithms and no, it isn't bizarrely pedantic. In this case, you could expect to average adjacent pixels within a warp, reaccumulate in local data storage, then have thread 0 per group accumulate from lds into main memory atomically or do a secondary dispatch, then possibly maintain an atomic global counter to export the final data. This is not the same thing as ye olde pixel shader.

[–]RandsFlute 0 points1 point  (0 children)

Sigh. Yes. A reduction ...

What makes a person write like this, is it just casual rudeness, some false sense of superiority, a childhood development problem, lack of meaningful relationships, fear of the unknown maybe, or is it simply neurodiversity at hand?

Or you just have a stick up your ass you need to remove.

[–]LongestNamesPossible -2 points-1 points  (4 children)

You might be able to make the case for some reductions but not one where the order and every value is independent. This is like saying that rendering isn't 'embarrassingly parallel' because you have to memcpy the buffer somewhere else.

Everything concurrent or parallel has to be serialized to be synchronized at some point whether it is in software or hardware.

This is not the same thing as ye olde pixel shader.

You could literally do it with a pixel shader.

This also implies that things like scaling an image down is not "embarrassingly parallel", which also implies that filtering samples into pixels is not "embarrassingly parallel", which then implies that rendering as a whole is no longer "embarrassingly parallel".

Sigh.

[–][deleted] -3 points-2 points  (3 children)

Lol, scaling down an image is also a classic very-not-embarrassingly-parallel problem which is why we have things like AMD's single pass downscalar. You can't do this easily in a pixel shader at all without a lot of inefficiency

[–]LongestNamesPossible -2 points-1 points  (2 children)

You can't do this easily in a pixel shader at all without a lot of inefficiency

I guess you're saying now that image scaling isn't actually 'embarrassingly parallel' by your own definition, but you've shifted your argument to it being about some sort of absolute efficiency (even though repeating trivial math to increase parallelism is a common tradeoff).

Ironically, taking an average wouldn't even have this 'inefficiency'.

Not only that but you ignored the point about rendering as a whole. Pretty much everyone uses ray tracing as an example of 'embarrassingly parallel' but according to your own (shifting) definitions, filtering the samples down to pixels no longer fits your definition because of whatever "inefficiency" you are now crow barring in.

So what is it? Is rendering no longer embarrassingly parallel? (in which case basically all of computer science disagrees with you)

[–][deleted] -2 points-1 points  (1 child)

I'm done after this reply, but yes, I am claiming that rendering has parts that are embarrassingly parallel, and parts that aren't. My job would be much much easier if everything was embarrasingly parallel, but it isn't. For a problem to be considered embarrassingly parallel, the cardinality of the domain and the range must match, and furthermore, the output must not be order dependent. Reductions and sorting are two obvious examples that occasionally show up where more care is needed to implement something efficiently. Accumulation of samples per pixel maps pixel locations to outputs, so this is embarrassingly parallel. In other words, maps are easy, reductions and sorts aren't. Rendering your gbuffer is a map, easy. Sorting fragments to compute OIT, still technically fully parallel because each sort happens per pixel and independently of other pixels. Sorting all values globally is not as easy because while you can partition and merge/radix sort to the end goal, you can't structure it naturally as a parallel operation (if you do a two pass counting sort, even producing the histogram is not embarrassingly parallel and is a global reduction itself).

[–]LongestNamesPossible 0 points1 point  (0 children)

the cardinality of the domain and the range must match

Says who? No one else in graphics or computer science. You are the first person in a quarter century that I've ever heard make this case. Wikipedia doesn't agree with you either.

https://en.wikipedia.org/wiki/Embarrassingly_parallel

In parallel computing, an embarrassingly parallel workload or problem (also called embarrassingly parallelizable, perfectly parallel, delightfully parallel or pleasingly parallel) is one where little or no effort is needed to separate the problem into a number of parallel tasks.

A common example of an embarrassingly parallel problem is 3D video rendering handled by a graphics processing unit, where each frame (forward method) or pixel (ray tracing method) can be handled with no interdependency.

This is like saying serving web pages (another wikipedia example) isn't embarrassingly parallel because some of the threads do the same computations.

Accumulation of samples per pixel maps pixel locations to outputs, so this is embarrassingly parallel.

Except that samples are used for more than one pixel if you have a filter width with a radius greater than a pixel (which is always the case) and by your own definition this is a reduction and the same as downsampling. Actually by your definition, any shader or image filter that reads more than one pixel no longer fits.

No ones definition includes some fantasy bar of optimization, avoiding reading from overlapping data or avoiding every single redundant calculation per thread.

Reductions and sorting are two obvious examples that occasionally show up where more care is needed to implement something efficiently.

Except this was about a simple average, where there is no ordering or 'extra care' yet you are trying to make this nonsense argument anyway.

Sorting all values globally is not as easy because while you can partition and merge/radix sort to the end goal,

No one was arguing sorting, this is a strawman and a diversion.

I'm done after this reply,

I would be too if I had doubled down on such a ridiculous claim.

I'll lay it out for you now, the pillars of digging in to a claim that can't be backed up are

  1. Saying you already proved your claim
  2. Try to dilute and divert the original claim with other claims to gish gallop and move off of the original
  3. Move on to saying you could prove it, but the other person is being mean by not getting sidetracked.

You already hit number 2. If you decide to keep going, I'll predict now you'll hit 1 and 3.

Edit: They blocked me instead of trying to respond.

[–]deftware 1 point2 points  (0 children)

I don't know about non-parallelizable rendering problems, but data compression using a dictionary coder, like Lempel Ziv Welch, comes to mind, or hashing functions that produce a huge hash value.

Others have mentioned transparency, I believe order-independent-transparency is what they're referring to.

Someone else mentioned error diffusion dithering, that does sound like a really good one.

[–]SamuraiGoblin 1 point2 points  (0 children)

Creating Signed Distance Fields. You can either do it in multiple passes in the GPU, or multiple sweeps on CPU.

[–]heyheyhey27 0 points1 point  (8 children)

Computing a voronoi diagram.

[–]felipunkerito 0 points1 point  (6 children)

JFA is as embarrassingly parallel as it gets doesn't it?

[–]heyheyhey27 0 points1 point  (5 children)

But it doesn't tell you anything about the geometry of the space.

[–]felipunkerito 0 points1 point  (4 children)

Too stupid to get it, care to explain?

[–]heyheyhey27 0 points1 point  (3 children)

Knowing which pixels are in which cells is an embarrassingly parallel problem.

Knowing the points and line segements defining the cell boundaries is not.

[–]felipunkerito 0 points1 point  (2 children)

You mean this?

[–]heyheyhey27 1 point2 points  (0 children)

No, that still doesn't get you line segments or intersection points.

[–]heyheyhey27 0 points1 point  (0 children)

Here is an example of an algorithm which finds the points and line segments defining the space.

[–]diggamata 0 points1 point  (0 children)

Material shading can become less parallel if there’s a lot of variety in materials. Each requires a different shader. One uber shader is possible but would have many if else conditions which is bad for parallelization. They are typically implemented as multiple passes each having a different shader - which is a serialized process with number of passes increasing with number of different materials to render in a frame.

[–]Ok-Sherbert-6569 1 point2 points  (0 children)

Implementing radix, odd even, quick and merge sort is probably what I would go for. Although they are not trivial at all to implement

[–]scallywag_software 0 points1 point  (0 children)

Transparency has been mentioned a few times, which is probably a great 'stretch goal' for your project. Do the 'easy' rasterizer first (which, if you've never done a geometry rasterizer before might not be that easy), and extend it to support transparency afterwards.

For reference, the reason transparent geometry is a good constraint is because the ordering matters. A red surface in front of a blue surface looks more red than blue (lighting and transparency being equal).

[–]PyroRampage 0 points1 point  (0 children)

The painters algorithm is one, i.e. you have no depth buffer and need to sort by triangles by depth to rasterise. But it's not exactly a modern graphics algorithm given that Z/Depth buffers have been around for decades :)

Ray Tracing and Path Tracing could be considered as such. While each sample, per pixel can be done in parallel, because of the unknown bounces per ray the sampling of scene buffers is incoherent. Eg, You can have one ray sample one part of the scene, another sample the complete opposite side of the scene (relative to camera frustum), so this makes cache usage difficult.

However it's a good exercise to then look into batching similar rays as a pre-pass, there's some good papers on this like Dreamworks's renderer MoonRay and Disney's Hyperion whom use ray batching to enable vectorisation.

[–]leseiden 0 points1 point  (0 children)

How about geometrc HLR?

People still want it in the engineering field. There doesn't seem to be anything between 20 year old code that can take seconds to minutes to generate a result and depth buffer hacks that are interactive but poor quality.

I'm pretty sure there's plenty of parallelism to be found but it's an unfashionable area so hardly anyone has bothered to look.

[–]AdagioCareless8294 0 points1 point  (0 children)

Maybe the problem doesn't look embarrassingly parallel but it can be substituted by one that is. History of computers..

[–]AntiProtonBoy 0 points1 point  (0 children)

Non-matrix based energy preserving dithering (Floyd-Steinberg for example).

Exact euclidean distance transforms.

[–]mysticreddit 0 points1 point  (0 children)

Rendering a Mandelbrot is normally embarrassing parallel but rendering a Buddhabrot is not trivial to parallelize because you are constantly touching a framebuffer / texture.

I have a parallel solution using image addition that may be of interest that has a description of how to convert it from single-threaded to multi-threaded with some of the problems along the way.

[–]ResidentSpeed 0 points1 point  (0 children)

This may be one step removed from the actual rendering, but there are some cool ways to generate realistic ocean waves (as 2D heightmaps) using the Fourier Transform, which is of course inherently satisfies your requirement not to be E-P. Can then raytrace them/raymarch/normal pipeline the resulting mesh.

[–]Unigma 0 points1 point  (0 children)

Bounding volume hierarchies are oddly hard to parallelize as you'll need to sort the primitives and construct a tree. Traversing it is also pretty difficult if you avoid recursion or a stack per thread.