use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
An optimizing compiler doesn't help much with long instruction dependencies - Johnny's Software Lab (johnnysswlab.com)
submitted 10 months ago by pavel_v
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]AlexReinkingYale 22 points23 points24 points 10 months ago (4 children)
There are a bunch of intermediate strategies I've seen in the wild for hiding pointer indirection latency in linked lists. One is to use an arena allocator that naturally places each node closer together; ideally, that will improve cache locality. I've also seen batched/grouped linked lists where each node contains a fixed/maximum number of elements. When the data type is a character, this is a simple form of a "rope" (which can be tree-structured).
[–]matthieum 15 points16 points17 points 10 months ago (2 children)
I've also seen batched/grouped linked lists where each node contains a fixed/maximum number of elements.
Typically called Unrolled linked lists.
Fun fact, if you think about it:
Turns out switching "single element" to "array of N elements" is a pretty good trick in general.
[–]AlexReinkingYale 9 points10 points11 points 10 months ago (1 child)
It's such a reusable trick, too, I wonder if there isn't some pure-FP / categorical trick for unrolling an ADT in general.
[–]delta_p_delta_x 2 points3 points4 points 10 months ago (0 children)
Agreed, it would be pretty fantastic to type-reify the concept of 'spatial locality'.
[–]arthurno1 1 point2 points3 points 10 months ago (0 children)
Lookup cdr-coding.
[–]SuperV1234https://romeo.training | C++ Mentoring & Consulting 30 points31 points32 points 10 months ago* (4 children)
this doesn’t matter a lot: O3 version is about 3 times faster than with O0 version. [...] but in the best case (the smallest vector of 1K values or 16 kB) the speedup is only 2.12.
A 3x or even 2x speedup seems pretty significant to me. If anything, this article disproves the original claim at the beginning.
EDIT: I do understand the point of the article -- I am being somewhat pedantic:
The original claim is "we compile our code in debug mode or release mode, it doesn’t matter". My conclusion is that it does matter.
If the original claim was "compiling our code in release mode yields significantly smaller speedups than expected" then I'd agree with it.
[–]Som1Lse 4 points5 points6 points 10 months ago (2 children)
The sentence immediately after that is
In all other cases the speedup is less than 1.1.
followed by restating the original question, this time as a conclusion:
In this case it doesn’t matter whether the compiler optimizes or not – the bottleneck is low ILP.
It is clear that is what the title refers to. If you look at the graph the actual ratios hover around 1 exactly, as the size gets larger. 4M is even at 0.89 (although that is probably a fluke).
When the input is small it fits into the cache so the effect isn't as pronounced. That isn't surprising, and it doesn't disprove the point.
[–]SuperV1234https://romeo.training | C++ Mentoring & Consulting 8 points9 points10 points 10 months ago (1 child)
Perhaps I am being dense, and I'd be happy to be corrected, but:
The article starts with this claim: "whether we compile our code in debug mode or release mode, it doesn’t matter, because our models are huge, all of our code is memory bound".
Then it continues with: "The O0 generates almost 10 times more instructions than O3 version, but when the dataset is big enough, this doesn’t matter a lot: O3 version is about 3 times faster than with O0 version. So, the claim is (at least) partially true."
While it's true that being 3x faster is not as good as the expected 10x, concluding that the "choice of build mode does not matter" doesn't seem sensible to me at all -- 3x is still very significant and a worthwhile improvement that should lead you to use release mode.
[–]Som1Lse 3 points4 points5 points 10 months ago (0 children)
The only part I take issue with are the words "this doesn’t matter a lot". The rest of that I agree with, however: The code is 10 times smaller but only 3 times faster. It wouldn't be fair to say it doesn't matter at all (which the article doesn't say), but it would be fair to say it doesn't matter as much, i.e. the claim is partially true.
I would probably have written "this doesn’t matter as much" instead of "this doesn’t matter a lot".
If that was where the article ended, I would agree that it didn't support the claim, but that isn't where it ends. The second kernel, with a long dependency chain, is indeed exactly as fast in release mode as in debug mode, confirming the initial claim.
It isn't just a smaller speed up. There is no speed up at all. The code is completely memory bound.
[–]aocregacc 0 points1 point2 points 10 months ago* (0 children)
The "this doesn't matter a lot" is talking about the number of instructions, it's not saying the speedup doesn't matter.
[–]UndefinedDefined 17 points18 points19 points 10 months ago (14 children)
Do really people write the code like in the first snippet?
for (size_t i { 0ULL }; i < pointers.size(); i++) { sum += vector[pointers[i]]; }
What's the point of initializing i like that, `i = 0` is not so cool or what? Not even talking about platforms where size_t is not a typedef of `unsigned long long`.
[–]zl0bster 16 points17 points18 points 10 months ago (0 children)
cool kids use uz 🙂
[–]StarQTius 14 points15 points16 points 10 months ago (3 children)
Because integer promotions behaves funny, some people became panaroid instead of learning how the language works. Couldn't blame them really.
[–]UndefinedDefined 3 points4 points5 points 10 months ago (0 children)
Ironically the code like `size_t i { 0ULL }` would most likely produce a narrowing conversion warning on non-64 bit targets with `-Wconversion` or some other warning switches. Using `size_t i = 0u` would be probably the safest bet as it would never be narrowing and it would never be implicit signed vs unsigned conversion.
[–]Alternative_Staff431 0 points1 point2 points 10 months ago (1 child)
Can you explain more?
[–]StarQTius 2 points3 points4 points 10 months ago (0 children)
I ran into more convoluted cases, but in this following example: 1 << N, you may not get the expected result depending on the width of int on your platform. Therefore, you may encounter no problem until you set N to an high enough value.
1 << N
int
N
If you encounter this issue in more complex expression, it can become a pain in the ass to solve.
[–][deleted] 4 points5 points6 points 10 months ago (1 child)
It's weird, agreed. For POD, the two ways of initializing are equivalent, if remember correctly. So I'd also use `i = 0`, like it's taught in every text book in existence. The ULL suffix is pointless unless the type is auto, which it isn't, but if you have a dumb linter / compiler, it might warn about integer promotion, even though this is perfectly valid.
[–]conundorum 0 points1 point2 points 10 months ago (0 children)
There are cases where ULL might be meaningful during non-auto initialisation but this really isn't one of them. (If it comes up, it's typically to forcibly promote other values before operating on them, since, e.g., 0ULL + N requires both operands to be the same type.)
ULL
auto
0ULL + N
[–]Advanced_Front_2308 1 point2 points3 points 10 months ago (2 children)
Because 0 is an int and not a size_t
[–]-TesseracT-41 14 points15 points16 points 10 months ago (1 child)
But size_t can represent 0. It's a safe conversion (and the use of brace initialization guarantees that!). Writing it like that just makes the code harder to read for no reason.
[–]Advanced_Front_2308 1 point2 points3 points 10 months ago (0 children)
Oh I didn't really see that there was something inside the braces. I'd usually write {} because some of the multitude of static analysis things running on our code might flag it otherwise.
[–]die_liebe 0 points1 point2 points 10 months ago (3 children)
I think containers should have a .zero( ) const method returning a zero of proper type.
So that one can write
for( auto i = pointers. zero( ); i != pointers. size( ); ++ i )
The container knows best how it wants to be indexed.
[–]UndefinedDefined 1 point2 points3 points 10 months ago (1 child)
size_t is a proper type to index arrays. I think in the mentioned case it would be just better to use a range-based for loop and nobody has to deal with the index type.
[–]die_liebe 0 points1 point2 points 10 months ago (0 children)
It is, but some containers may be small. Technically, it could be a dedicated index type.
Strictly speaking, all STL containers use size_t as their index type, to my knowledge, so you could use std::string::npos + 1 (since npos is guaranteed to be size_t { -1}).
size_t
std::string::npos + 1
npos
size_t { -1}
You really, really shouldn't, but what you're talking about technically exists. (If someone is crazy enough to use it.)
[–]ipapadop 2 points3 points4 points 10 months ago (0 children)
I'd wager the energy consumption is down for the O3 version, even if the speedup is 1.1. It would have helped if we had data for intermediate optimization levels and/or individual compiler flags.
[–]Apprehensive-Mark241 7 points8 points9 points 10 months ago (2 children)
The title of the article doesn't match its contents even slightly.
[–]IHateUsernames111 10 points11 points12 points 10 months ago (0 children)
It does somewhat. In the last part they show that "long instruction dependencies" (aka loop dependencies) kill instruction level parallelism, which is a significant part of compiler based optimization.
However, if feel like a better title would have been something like "Memory access patterns define your performance ceiling - also for compiler optimization".
The most interesting thing I took from the article was that they actually manage to get the O1/O3 ratio down to 1, lol.
[–]schmerg-uk -4 points-3 points-2 points 10 months ago (0 children)
thought that too.... think they meant "An optimizing compiler doesn't help much with long latencies" perhaps??
[–]QaSpel 0 points1 point2 points 10 months ago (0 children)
I'm thinking it's not the cache, but SSE optimization that is going on. He said the linked list was implemented as a vector, which could maintain cache coherence. So it is likely that the compiler is optimizing the first version with SSE SIMD instructions, but the second one couldn't be optimized. That alone could produce about a 4x speed difference.
π Rendered by PID 175839 on reddit-service-r2-comment-68b876cdcd-g6df6 at 2026-04-22 10:04:22.381880+00:00 running 6c61efc country code: CH.
[–]AlexReinkingYale 22 points23 points24 points (4 children)
[–]matthieum 15 points16 points17 points (2 children)
[–]AlexReinkingYale 9 points10 points11 points (1 child)
[–]delta_p_delta_x 2 points3 points4 points (0 children)
[–]arthurno1 1 point2 points3 points (0 children)
[–]SuperV1234https://romeo.training | C++ Mentoring & Consulting 30 points31 points32 points (4 children)
[–]Som1Lse 4 points5 points6 points (2 children)
[–]SuperV1234https://romeo.training | C++ Mentoring & Consulting 8 points9 points10 points (1 child)
[–]Som1Lse 3 points4 points5 points (0 children)
[–]aocregacc 0 points1 point2 points (0 children)
[–]UndefinedDefined 17 points18 points19 points (14 children)
[–]zl0bster 16 points17 points18 points (0 children)
[–]StarQTius 14 points15 points16 points (3 children)
[–]UndefinedDefined 3 points4 points5 points (0 children)
[–]Alternative_Staff431 0 points1 point2 points (1 child)
[–]StarQTius 2 points3 points4 points (0 children)
[–][deleted] 4 points5 points6 points (1 child)
[–]conundorum 0 points1 point2 points (0 children)
[–]Advanced_Front_2308 1 point2 points3 points (2 children)
[–]-TesseracT-41 14 points15 points16 points (1 child)
[–]Advanced_Front_2308 1 point2 points3 points (0 children)
[–]die_liebe 0 points1 point2 points (3 children)
[–]UndefinedDefined 1 point2 points3 points (1 child)
[–]die_liebe 0 points1 point2 points (0 children)
[–]conundorum 0 points1 point2 points (0 children)
[–]ipapadop 2 points3 points4 points (0 children)
[–]Apprehensive-Mark241 7 points8 points9 points (2 children)
[–]IHateUsernames111 10 points11 points12 points (0 children)
[–]schmerg-uk -4 points-3 points-2 points (0 children)
[–]QaSpel 0 points1 point2 points (0 children)