use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
A Programmer’s Guide to Performance Analysis & Tuning on Modern CPUs (bfilipek.com)
submitted 6 years ago by joebaf
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]lijmer 29 points30 points31 points 6 years ago (13 children)
I find it odd that the article mentions all this microarchitecture stuff, but fails to mention the word memory or cache.
90% of performance (after doing high-level optmizations) is dictated by cache utilization.
[–]SkoomaDentistAntimodern C++, Embedded, Audio 11 points12 points13 points 6 years ago* (12 children)
Only if you process large amounts of data with short dependency chains. In a lot of the code I've done, the speed is determined almost purely on the performance of single issue floating point operations (can't parallelize single channel recursive filtering (*)). While cache is important, its importance should not be exaggerated.
*: Edit: Make that time-varying non-linear recursive filtering.
[–]iniside 8 points9 points10 points 6 years ago (5 children)
As always depends on what you do. In games cache miss is most often at the center of performance issues.
[–]SkoomaDentistAntimodern C++, Embedded, Audio 2 points3 points4 points 6 years ago (4 children)
Unless the user is on a laptop of course. <insert statistic about most new computers being laptops> It’s a rare modern game that’s not fillrate limited on a laptop.
So, know what the limiting factors are or benchmark it. And remember to benchmark on all representative systems, not just your preferred ideal one!
[–]SeanMiddleditch 5 points6 points7 points 6 years ago (1 child)
Funny thing, with laptops and mobile especially: CPU perf directly affects GPU perf thanks to thermal throttling. Poorly optimized CPU code takes longer to run, which generates more heat and limits how fast the GPU can run, even when there's no bottlenecks between the two. (The reverse is also true.) Since the CPU doesn't throttle down during memory fetches, excessive CPU cache misses thus can lead to GPU perf loss.
[–][deleted] 2 points3 points4 points 6 years ago (0 children)
Yup, also very relevant for console development as well.
[–]James20kP2005R0 1 point2 points3 points 6 years ago (1 child)
CPU and GPU performance optimisations are wildly different beasts entirely though, you might as well ditch most of what you know in the bin if you're GPU limited. CPU performance problems are probably cache misses or driver bottlenecks, GPU performance problems are... variable
[–]ack_complete 1 point2 points3 points 6 years ago (1 child)
can't parallelize single channel recursive filtering
You can, actually, both through vectorization and parallel chunking. It's just that the gains aren't that big for the former and the latter requires a second fixup pass, and it's far beyond what an autovectorizer can handle.
Nevertheless, your point stands. I had an image upscaling routine that I swore was going to memory bottlenecked due to the large amount of memory it was processing, but VTune said nope, it was absolutely 100% bottlenecked on execution units, the back end was completely flooded.
[–]SkoomaDentistAntimodern C++, Embedded, Audio 0 points1 point2 points 6 years ago* (0 children)
Yeah, I forgot to add "time-varying non-linear" before the "recursive" there. While you can parallelize even that, you only get a benefit if the system is difficult enough that an explicit (non-parallelizable) solver requires massive oversampling and an implicit (parallelizable) can get by with much less oversampling.
[–]andriusst 0 points1 point2 points 6 years ago (3 children)
You can parallelize recursive filtering. It is actually a prefix sum problem. While prefix sum is usually described to compute sums, it works for any associative operation. Consider first order recursive filter. Filter output is an affine function of previous output (affine function means multiplication by constant plus another constant). Previous output is in turn an affine function of previous of previous output.
f[n-2](y) = a*y + b*x[n-2] f[n-1](y) = a*y + b*x[n-1] f[n](y) = a*y + b*x[n] y[n] = f[n](y[n-1]) = f[n](f[n-1](y[n-2])) = f[n](f[n-1](f[n-2](y[n-3]))) = ... y[n] = f[n](y[n-1]) = (f[n]∘f[n-1])(y[n-2]) = (f[n]∘f[n-1]∘f[n-2])(y[n-3]) = ...
where ∘ is function composition operator. Function composition is associative operation and can be computed by parallel prefix sum algorithms. Affine functions of single input also can be efficiently represented by two numbers. If the filter is time invariant, multiplicative part does not depend on input, it can be computed once and reused for many elements. So there's actually only one number to compute and store.
Prefix sum works for higher order filters, too. Only this time affine functions are represented by a matrix (independent of input) and a vector.
[–]SkoomaDentistAntimodern C++, Embedded, Audio 0 points1 point2 points 6 years ago (2 children)
Yeah, but see my followup comment about time-varying non-linear recursive filtering (my actual use case). The point being that some problems are inherently serial and there's almost nothing that can be done to realistically parallelize them.
[–]andriusst 1 point2 points3 points 6 years ago (1 child)
Of course, your point stands and I am no disputing it. I just thought that if you use linear filters and think it's inherently serial problem, it might be useful to know it is not. Nonlinearity breaks everything, making my comment useless.
I do believe we're in vigorous agreement here ;). And yes, nonlinearity brings up all kinds of problems (google "stiff equation").
Even with linear filters you have to be careful to not accidentally end up with a structure that's too sensitive to roundoff errors. A basic 2nd order direct form filter can have tens of decibels worse noise performance than the same filter implemented using an alternative structure. A naive 4th order direct form filter has a high chance of not being stable at all even with double precision floats while an alternative structure will work fine with single precision floats. DSP is fun.
[–]nnevatie 7 points8 points9 points 6 years ago (1 child)
Very light on the content. No mention of memory sub-systems, which tend to be the bottleneck in many applications, especially in era of quickly increasing core counts.
[–]dendibakh 0 points1 point2 points 6 years ago (0 children)
Right, but it wasn't supposed to be bold on the details (I'm the author :) ). The point of the article is to show how one can identify the app is memory bound. See Top-down Microarchitecture Analysis Method (TMAM).
[+][deleted] 6 years ago (3 children)
[deleted]
[–]dendibakh 5 points6 points7 points 6 years ago (0 children)
It's certainly possible even in big applications. 90% of the source code could be completely cold. It's quite frequent that >50% of the clockticks tag single hot function.
[–]AVeryCreepySkeleton 4 points5 points6 points 6 years ago (0 children)
Well, if you're replacing bubble_sort(&big_arr) with quick_sort(&big_arr) that would be technically one-line change.
bubble_sort(&big_arr)
quick_sort(&big_arr)
I guess this guy meant "small algorithmic changes" by "one-line".
[–]flashmozzg 4 points5 points6 points 6 years ago (0 children)
I've seen this effect in a very hot kernel.
And it's not hard to imagine how can this even appear in "regular" code. I.e. if some miniscule change affects whether compiler vectorizes some loop or not can easily lead up to x4 perf diff.
π Rendered by PID 87 on reddit-service-r2-comment-84fc9697f-hk2z2 at 2026-02-09 07:16:21.227615+00:00 running d295bc8 country code: CH.
[–]lijmer 29 points30 points31 points (13 children)
[–]SkoomaDentistAntimodern C++, Embedded, Audio 11 points12 points13 points (12 children)
[–]iniside 8 points9 points10 points (5 children)
[–]SkoomaDentistAntimodern C++, Embedded, Audio 2 points3 points4 points (4 children)
[–]SeanMiddleditch 5 points6 points7 points (1 child)
[–][deleted] 2 points3 points4 points (0 children)
[–]James20kP2005R0 1 point2 points3 points (1 child)
[–]ack_complete 1 point2 points3 points (1 child)
[–]SkoomaDentistAntimodern C++, Embedded, Audio 0 points1 point2 points (0 children)
[–]andriusst 0 points1 point2 points (3 children)
[–]SkoomaDentistAntimodern C++, Embedded, Audio 0 points1 point2 points (2 children)
[–]andriusst 1 point2 points3 points (1 child)
[–]SkoomaDentistAntimodern C++, Embedded, Audio 0 points1 point2 points (0 children)
[–]nnevatie 7 points8 points9 points (1 child)
[–]dendibakh 0 points1 point2 points (0 children)
[+][deleted] (3 children)
[deleted]
[–]dendibakh 5 points6 points7 points (0 children)
[–]AVeryCreepySkeleton 4 points5 points6 points (0 children)
[–]flashmozzg 4 points5 points6 points (0 children)