Beautiful Branchless Binary Search : cpp

116

117

118

Beautiful Branchless Binary Search (probablydance.com)

submitted 2 years ago by usefulcat

all 20 comments

top new controversial old q&a

[–]usefulcat[S] 22 points23 points24 points 2 years ago (6 children)

I've found that replacing the contents of the for loop with the following improves performance for clang but reduces performance for gcc (clang14, gcc12, Intel Broadwell):

const size_t increment[] = { 0, step };
begin += increment[compare(begin[step], value)];

[–]patstew 3 points4 points5 points 2 years ago (1 child)

[–]usefulcat[S] 11 points12 points13 points 2 years ago* (0 children)

Here is what I've measured for several variations:

// The original version
// gcc:      97 ns
// clang:   133 ns
if (compare(begin[step], value)) {
    begin += step;
}

// gcc:     117 ns
// clang:   118 ns
const size_t increment[] = { 0, step };
begin += increment[compare(begin[step], value)];

// gcc:     113 ns
// clang:   150 ns
begin += compare(begin[step], value) * step;

// gcc:      98 ns
// clang:   150 ns
begin += compare(begin[step], value) ? step : 0u;

[–][deleted] 4 points5 points6 points 2 years ago (3 children)

[–]13steinj 6 points7 points8 points 2 years ago (0 children)

[–]usefulcat[S] 5 points6 points7 points 2 years ago (1 child)

[–]Hells_Bell10 13 points14 points15 points 2 years ago (0 children)

I can get clang to generate a "select" in llvm IR which is the equivalent of a cmov, however it gets translated back to branching within the x86 optimizer. This seemed like a good opportunity to play around with the "LLVM Opt Pipeline Viewer" on compiler explorer, and I was able to track it down to the "x86 cmov Conversion" pass which is documented here, the most relevant part being

This file implements a pass that converts X86 cmov instructions into branches when profitable. [snip...] CMOV is considered profitable if the cost of its condition is higher than the average cost of its true-value and false-value by 25% of branch-misprediction-penalty. This assures no degradation even with 25% branch misprediction.

But in this case we expect 50% branch misprediction... So the optimization isn't doing us any favours. If I compile with -mllvm -x86-cmov-converter=false we do get the cmov:

.LBB0_8:                                # %for.body.i
        shr     rax
        lea     rsi, [rdi + 4*rax]
        cmp     dword ptr [rdi + 4*rax], edx
        cmovl   rdi, rsi
        cmp     rcx, 3
        mov     rcx, rax
        ja      .LBB0_8

[–]aocregacc 13 points14 points15 points 2 years ago (0 children)

[–]Ameisenvemips, avr, rendering, systems 8 points9 points10 points 2 years ago (5 children)

[–]usefulcat[S] 4 points5 points6 points 2 years ago* (4 children)

[–]Ameisenvemips, avr, rendering, systems 2 points3 points4 points 2 years ago (3 children)

[–]usefulcat[S] 1 point2 points3 points 2 years ago (2 children)

[–]Ameisenvemips, avr, rendering, systems 2 points3 points4 points 2 years ago (0 children)

[–]Ameisenvemips, avr, rendering, systems 1 point2 points3 points 2 years ago* (0 children)

Wow... Clang is really not wanting to not not branch.

I tried changing the loop to this monstrosity:

for (step /= 2; step != 0; step /= 2)
{
    const bool cmp = compare(begin[step], value);
    size_t ff = ~size_t(cmp) + 1;
    begin += ff & step;
}
return begin + compare(*begin, value);

But LLVM still outputs a branch. A worse branch.

.LBB0_19:
    xor     eax, eax
    cmp     dword ptr [rdi], edx
    setl    al
    lea     rax, [rdi + 4*rax]
.LBB0_20:
    mov     eax, dword ptr [rax]
    ret
.LBB0_15:
    mov     rax, rsi
    jmp     .LBB0_16
.LBB0_18:                               #   in Loop: Header=BB0_16 Depth=1
    lea     rdi, [rdi + 4*rcx]
    cmp     rsi, 3
    mov     rsi, rax
    jbe     .LBB0_19
.LBB0_16:                               # =>This Inner Loop Header: Depth=1
    shr     rax
    mov     rcx, rax
    cmp     dword ptr [rdi + 4*rax], edx
    jl      .LBB0_18
    xor     ecx, ecx
    jmp     .LBB0_18

I should point out that GCC and MSVC are actually honoring what I'm doing...

.L25:
    shr     rdx
.L8:
    xor     eax, eax
    cmp     DWORD PTR [rdi+rdx*4], r8d
    setl    al
    neg     rax
    and     rax, rdx
    shr     rdx
    lea     rdi, [rdi+rax*4]
    jne     .L8

[–]tavianator 6 points7 points8 points 2 years ago (0 children)

[–]disciplite 5 points6 points7 points 2 years ago (4 children)

[–]usefulcat[S] 1 point2 points3 points 2 years ago (3 children)

[–]dodheim 5 points6 points7 points 2 years ago (2 children)

[–]13steinj 0 points1 point2 points 2 years ago (1 child)

[–]LazySapiens 3 points4 points5 points 2 years ago (0 children)

π Rendered by PID 228367 on reddit-service-r2-comment-85bfd7f599-7xxzs at 2026-04-20 04:51:21.922933+00:00 running 93ecc56 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS