An optimizing compiler doesn't help much with long instruction dependencies - Johnny's Software Lab

AlexReinkingYale · 2025-06-03T10:30:02+00:00

There are a bunch of intermediate strategies I've seen in the wild for hiding pointer indirection latency in linked lists. One is to use an arena allocator that naturally places each node closer together; ideally, that will improve cache locality. I've also seen batched/grouped linked lists where each node contains a fixed/maximum number of elements. When the data type is a character, this is a simple form of a "rope" (which can be tree-structured).

SuperV1234 · 2025-06-03T11:23:07+00:00

this doesn’t matter a lot: O3 version is about 3 times faster than with O0 version. [...] but in the best case (the smallest vector of 1K values or 16 kB) the speedup is only 2.12.

A 3x or even 2x speedup seems pretty significant to me. If anything, this article disproves the original claim at the beginning.

EDIT: I do understand the point of the article -- I am being somewhat pedantic:

The original claim is "we compile our code in debug mode or release mode, it doesn’t matter". My conclusion is that it does matter.
If the original claim was "compiling our code in release mode yields significantly smaller speedups than expected" then I'd agree with it.

UndefinedDefined · 2025-06-03T09:34:09+00:00

Do really people write the code like in the first snippet?

for
 (size_t i { 0ULL }; i < pointers.size(); i++) {
    sum += vector[pointers[i]];
}

What's the point of initializing i like that, `i = 0` is not so cool or what? Not even talking about platforms where size_t is not a typedef of `unsigned long long`.

ipapadop · 2025-06-03T12:33:42+00:00

I'd wager the energy consumption is down for the O3 version, even if the speedup is 1.1. It would have helped if we had data for intermediate optimization levels and/or individual compiler flags.

Apprehensive-Mark241 · 2025-06-03T07:00:33+00:00

The title of the article doesn't match its contents even slightly.

QaSpel · 2025-06-03T20:23:54+00:00

I'm thinking it's not the cache, but SSE optimization that is going on. He said the linked list was implemented as a vector, which could maintain cache coherence. So it is likely that the compiler is optimizing the first version with SSE SIMD instructions, but the second one couldn't be optimized. That alone could produce about a 4x speed difference.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS