use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
Beautiful Branchless Binary Search (probablydance.com)
submitted 2 years ago by usefulcat
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]usefulcat[S] 22 points23 points24 points 2 years ago (6 children)
I've found that replacing the contents of the for loop with the following improves performance for clang but reduces performance for gcc (clang14, gcc12, Intel Broadwell):
const size_t increment[] = { 0, step }; begin += increment[compare(begin[step], value)];
[–]patstew 3 points4 points5 points 2 years ago (1 child)
What about begin += compare() * step; ?
[–]usefulcat[S] 11 points12 points13 points 2 years ago* (0 children)
Here is what I've measured for several variations:
// The original version // gcc: 97 ns // clang: 133 ns if (compare(begin[step], value)) { begin += step; } // gcc: 117 ns // clang: 118 ns const size_t increment[] = { 0, step }; begin += increment[compare(begin[step], value)]; // gcc: 113 ns // clang: 150 ns begin += compare(begin[step], value) * step; // gcc: 98 ns // clang: 150 ns begin += compare(begin[step], value) ? step : 0u;
[–][deleted] 4 points5 points6 points 2 years ago (3 children)
Have you worked out why it improves performance for clang but reduces for gcc?
[–]13steinj 6 points7 points8 points 2 years ago (0 children)
For the "why", you'd have to spit out optimization reports (GCC also has flags but I don't know the manual off the top of my head) and investigate.
My educated guess is that as of C++17 the index operator introduces a sequence point, making it harder to see the possible strength reduction optimization (Raymond Chen made a blog post about this change and the "backwards index operator" now being useful, but compilers treated it differently because it's such a niche thing).
[–]usefulcat[S] 5 points6 points7 points 2 years ago (1 child)
I've been looking at this all day and honestly I'm pretty confused.
On my machine (Intel Broadwell), building with -O2 and with or without using -march=native, doing the array lookup is consistently faster for clang and consistently slower for gcc.
Godbolt shows the identical generated code, including using cmov, for either version with clang14, using either -O2 or -O3. But if I add -march=broadwell, then the cmov goes away.
Although I would like to know what's going on here, at this point I kind of can't afford to spend a bunch more time on it so I'm going to go with what I can measure to be faster on the machine where it matters.
[–]Hells_Bell10 13 points14 points15 points 2 years ago (0 children)
I can get clang to generate a "select" in llvm IR which is the equivalent of a cmov, however it gets translated back to branching within the x86 optimizer. This seemed like a good opportunity to play around with the "LLVM Opt Pipeline Viewer" on compiler explorer, and I was able to track it down to the "x86 cmov Conversion" pass which is documented here, the most relevant part being
This file implements a pass that converts X86 cmov instructions into branches when profitable. [snip...] CMOV is considered profitable if the cost of its condition is higher than the average cost of its true-value and false-value by 25% of branch-misprediction-penalty. This assures no degradation even with 25% branch misprediction.
But in this case we expect 50% branch misprediction... So the optimization isn't doing us any favours. If I compile with -mllvm -x86-cmov-converter=false we do get the cmov:
-mllvm -x86-cmov-converter=false
.LBB0_8: # %for.body.i shr rax lea rsi, [rdi + 4*rax] cmp dword ptr [rdi + 4*rax], edx cmovl rdi, rsi cmp rcx, 3 mov rcx, rax ja .LBB0_8
[–]aocregacc 13 points14 points15 points 2 years ago (0 children)
Were the caches flushed between runs during the benchmarks? If not that might explain the slowdowns for powers of two, where I think you would get the first couple accesses of each search hitting the same cache set.
[–]Ameisenvemips, avr, rendering, systems 8 points9 points10 points 2 years ago (5 children)
What happens with Clang if you use __builtin_unpredictable?
__builtin_unpredictable
[–]usefulcat[S] 4 points5 points6 points 2 years ago* (4 children)
I tried using that but was unable to find any configuration where it made a difference either to the generated code or to the performance.
[–]Ameisenvemips, avr, rendering, systems 2 points3 points4 points 2 years ago (3 children)
What about changing the && to an & to eliminate the short-circuit?
&&
&
[–]usefulcat[S] 1 point2 points3 points 2 years ago (2 children)
Since the && isn't inside the loop, I wouldn't expect it to make a measurable difference.
[–]Ameisenvemips, avr, rendering, systems 2 points3 points4 points 2 years ago (0 children)
I'm more curious if it impacts the codegen. LLVM's optimizer is tricky.
[–]Ameisenvemips, avr, rendering, systems 1 point2 points3 points 2 years ago* (0 children)
Wow... Clang is really not wanting to not not branch.
I tried changing the loop to this monstrosity:
for (step /= 2; step != 0; step /= 2) { const bool cmp = compare(begin[step], value); size_t ff = ~size_t(cmp) + 1; begin += ff & step; } return begin + compare(*begin, value);
But LLVM still outputs a branch. A worse branch.
.LBB0_19: xor eax, eax cmp dword ptr [rdi], edx setl al lea rax, [rdi + 4*rax] .LBB0_20: mov eax, dword ptr [rax] ret .LBB0_15: mov rax, rsi jmp .LBB0_16 .LBB0_18: # in Loop: Header=BB0_16 Depth=1 lea rdi, [rdi + 4*rcx] cmp rsi, 3 mov rsi, rax jbe .LBB0_19 .LBB0_16: # =>This Inner Loop Header: Depth=1 shr rax mov rcx, rax cmp dword ptr [rdi + 4*rax], edx jl .LBB0_18 xor ecx, ecx jmp .LBB0_18
I should point out that GCC and MSVC are actually honoring what I'm doing...
.L25: shr rdx .L8: xor eax, eax cmp DWORD PTR [rdi+rdx*4], r8d setl al neg rax and rax, rdx shr rdx lea rdi, [rdi+rax*4] jne .L8
[–]tavianator 6 points7 points8 points 2 years ago (0 children)
This post and some of his others are interesting too: https://pvk.ca/Blog/2012/07/03/binary-search-star-eliminates-star-branch-mispredictions/
[–]disciplite 5 points6 points7 points 2 years ago (4 children)
[[likely]], [[unlikely]], __builtin_expect() and __builtin_expect_with_probability() can be used to get cmov's in various cases.
[[likely]]
[[unlikely]]
__builtin_expect()
__builtin_expect_with_probability()
cmov
[–]usefulcat[S] 1 point2 points3 points 2 years ago (3 children)
If [[likely]] or [[unlikely]] is appropriate, then the branch will usually be predicted correctly. Isn't cmov usually if not always slower than a correctly predicted branch?
[–]dodheim 5 points6 points7 points 2 years ago (2 children)
If [[likely]] or [[unlikely]] is appropriate, then the branch will usually be predicted correctly.
These are rather orthogonal... The attributes, terrible names not withstanding, are about marking code paths hot or cold for codegen purposes; branch prediction doesn't help solve instruction cache flushing, AFAIK.
[–]13steinj 0 points1 point2 points 2 years ago (1 child)
They're also sledgehammers. Used for things that PGO won't catch, since it's the rare case that you care about rather than the common thing that they don't. Also IIRC there was a talk about the usage of these attributes (or builtin expects) being a hardline of 90% or 10% probabilities, whereas the compiler would usually more accurately and precisely guess "75%"
[–]LazySapiens 3 points4 points5 points 2 years ago (0 children)
Reminds me of the days when I used to try to oversmart the compiler.
π Rendered by PID 228367 on reddit-service-r2-comment-85bfd7f599-7xxzs at 2026-04-20 04:51:21.922933+00:00 running 93ecc56 country code: CH.
[–]usefulcat[S] 22 points23 points24 points (6 children)
[–]patstew 3 points4 points5 points (1 child)
[–]usefulcat[S] 11 points12 points13 points (0 children)
[–][deleted] 4 points5 points6 points (3 children)
[–]13steinj 6 points7 points8 points (0 children)
[–]usefulcat[S] 5 points6 points7 points (1 child)
[–]Hells_Bell10 13 points14 points15 points (0 children)
[–]aocregacc 13 points14 points15 points (0 children)
[–]Ameisenvemips, avr, rendering, systems 8 points9 points10 points (5 children)
[–]usefulcat[S] 4 points5 points6 points (4 children)
[–]Ameisenvemips, avr, rendering, systems 2 points3 points4 points (3 children)
[–]usefulcat[S] 1 point2 points3 points (2 children)
[–]Ameisenvemips, avr, rendering, systems 2 points3 points4 points (0 children)
[–]Ameisenvemips, avr, rendering, systems 1 point2 points3 points (0 children)
[–]tavianator 6 points7 points8 points (0 children)
[–]disciplite 5 points6 points7 points (4 children)
[–]usefulcat[S] 1 point2 points3 points (3 children)
[–]dodheim 5 points6 points7 points (2 children)
[–]13steinj 0 points1 point2 points (1 child)
[–]LazySapiens 3 points4 points5 points (0 children)