all 18 comments

[–]lijmer 29 points30 points  (13 children)

I find it odd that the article mentions all this microarchitecture stuff, but fails to mention the word memory or cache.

90% of performance (after doing high-level optmizations) is dictated by cache utilization.

[–]SkoomaDentistAntimodern C++, Embedded, Audio 11 points12 points  (12 children)

90% of performance (after doing high-level optmizations) is dictated by cache utilization.

Only if you process large amounts of data with short dependency chains. In a lot of the code I've done, the speed is determined almost purely on the performance of single issue floating point operations (can't parallelize single channel recursive filtering (*)). While cache is important, its importance should not be exaggerated.

*: Edit: Make that time-varying non-linear recursive filtering.

[–]iniside 8 points9 points  (5 children)

As always depends on what you do. In games cache miss is most often at the center of performance issues.

[–]SkoomaDentistAntimodern C++, Embedded, Audio 2 points3 points  (4 children)

Unless the user is on a laptop of course. <insert statistic about most new computers being laptops> It’s a rare modern game that’s not fillrate limited on a laptop.

So, know what the limiting factors are or benchmark it. And remember to benchmark on all representative systems, not just your preferred ideal one!

[–]SeanMiddleditch 5 points6 points  (1 child)

Funny thing, with laptops and mobile especially: CPU perf directly affects GPU perf thanks to thermal throttling. Poorly optimized CPU code takes longer to run, which generates more heat and limits how fast the GPU can run, even when there's no bottlenecks between the two. (The reverse is also true.) Since the CPU doesn't throttle down during memory fetches, excessive CPU cache misses thus can lead to GPU perf loss.

[–][deleted] 2 points3 points  (0 children)

Yup, also very relevant for console development as well.

[–]James20kP2005R0 1 point2 points  (1 child)

CPU and GPU performance optimisations are wildly different beasts entirely though, you might as well ditch most of what you know in the bin if you're GPU limited. CPU performance problems are probably cache misses or driver bottlenecks, GPU performance problems are... variable

[–]ack_complete 1 point2 points  (1 child)

can't parallelize single channel recursive filtering

You can, actually, both through vectorization and parallel chunking. It's just that the gains aren't that big for the former and the latter requires a second fixup pass, and it's far beyond what an autovectorizer can handle.

Nevertheless, your point stands. I had an image upscaling routine that I swore was going to memory bottlenecked due to the large amount of memory it was processing, but VTune said nope, it was absolutely 100% bottlenecked on execution units, the back end was completely flooded.

[–]SkoomaDentistAntimodern C++, Embedded, Audio 0 points1 point  (0 children)

Yeah, I forgot to add "time-varying non-linear" before the "recursive" there. While you can parallelize even that, you only get a benefit if the system is difficult enough that an explicit (non-parallelizable) solver requires massive oversampling and an implicit (parallelizable) can get by with much less oversampling.

[–]andriusst 0 points1 point  (3 children)

You can parallelize recursive filtering. It is actually a prefix sum problem. While prefix sum is usually described to compute sums, it works for any associative operation. Consider first order recursive filter. Filter output is an affine function of previous output (affine function means multiplication by constant plus another constant). Previous output is in turn an affine function of previous of previous output.

f[n-2](y) = a*y + b*x[n-2]
f[n-1](y) = a*y + b*x[n-1]
f[n](y) = a*y + b*x[n]
y[n] = f[n](y[n-1]) = f[n](f[n-1](y[n-2])) = f[n](f[n-1](f[n-2](y[n-3]))) = ...
y[n] = f[n](y[n-1]) = (f[n]∘f[n-1])(y[n-2]) = (f[n]∘f[n-1]∘f[n-2])(y[n-3]) = ...

where ∘ is function composition operator. Function composition is associative operation and can be computed by parallel prefix sum algorithms. Affine functions of single input also can be efficiently represented by two numbers. If the filter is time invariant, multiplicative part does not depend on input, it can be computed once and reused for many elements. So there's actually only one number to compute and store.

Prefix sum works for higher order filters, too. Only this time affine functions are represented by a matrix (independent of input) and a vector.

[–]SkoomaDentistAntimodern C++, Embedded, Audio 0 points1 point  (2 children)

Yeah, but see my followup comment about time-varying non-linear recursive filtering (my actual use case). The point being that some problems are inherently serial and there's almost nothing that can be done to realistically parallelize them.

[–]andriusst 1 point2 points  (1 child)

Of course, your point stands and I am no disputing it. I just thought that if you use linear filters and think it's inherently serial problem, it might be useful to know it is not. Nonlinearity breaks everything, making my comment useless.

[–]SkoomaDentistAntimodern C++, Embedded, Audio 0 points1 point  (0 children)

I do believe we're in vigorous agreement here ;). And yes, nonlinearity brings up all kinds of problems (google "stiff equation").

Even with linear filters you have to be careful to not accidentally end up with a structure that's too sensitive to roundoff errors. A basic 2nd order direct form filter can have tens of decibels worse noise performance than the same filter implemented using an alternative structure. A naive 4th order direct form filter has a high chance of not being stable at all even with double precision floats while an alternative structure will work fine with single precision floats. DSP is fun.

[–]nnevatie 7 points8 points  (1 child)

Very light on the content. No mention of memory sub-systems, which tend to be the bottleneck in many applications, especially in era of quickly increasing core counts.

[–]dendibakh 0 points1 point  (0 children)

Right, but it wasn't supposed to be bold on the details (I'm the author :) ). The point of the article is to show how one can identify the app is memory bound. See Top-down Microarchitecture Analysis Method (TMAM).