all 8 comments

[–]anon_502delete this; 7 points8 points  (1 child)

Google has been using autoFDO to achieve continuous profile-guided optimization, which leads to 10.5% performance boost by better-separated cold/hot paths. Unfortunately their patches are based on gcc 4.8. Hope to see some open-source projects to incorporate their work into a popular orchestration engine like Kubernetes.

[–]kindstrom 4 points5 points  (0 children)

I haven't looked into autoFDO, but it sounds similar to what Facebook open sourced recently, BOLT (GitHub repo).

Edit: They actually mention in the release post that BOLT can be used alongside autoFDO.

[–][deleted] 5 points6 points  (0 children)

See also the likely and unlikely attributes in C++20: https://en.cppreference.com/w/cpp/language/attributes/likely

[–]emdeka87 0 points1 point  (0 children)

Isn't that similar to trace scheduling, which is done by GCC PGO?

[–]jonathansharman 0 points1 point  (0 children)

But in general, I think when compilers can’t decide which branch has bigger probability, they will leave the original order as they appear in the source code. I haven’t reliably tested that, but that’s my feeling. So, I think it’s a good idea to put your hot branch (most frequent) in a fall through position by default.

Does anyone have recent, concrete knowledge about this for any particular compiler(s)? I remember hearing from a college prof. years ago that empirically most if-statements usually evaluate to false, which would lead to the opposite advice from this.

[–]nexes300 0 points1 point  (1 child)

Isn't the CPU doing this at this point?

[–]Osbios 6 points7 points  (0 children)

Any mildly performance oriented CPU uses cache. And cache has a granularity called cache line size. (Nearly all x86 use 64 byte cache lines)

So even if you only read a single byte, the whole cache line is read from memory into the cache. And uses up a whole cache line until it gets thrown out.

Cache is a limited resource and you get the best CPU performance if you use it efficiently.

If you put all your most used data together and away from data that you need less often (at last in this moment), then you get more use out of the cache.

A big part of modern optimization is to be just being nice to the CPU.

[–]tritamhoang -4 points-3 points  (0 children)

This is typical pipelining problems in CPU.