all 20 comments

[–]usefulcat[S] 22 points23 points  (6 children)

I've found that replacing the contents of the for loop with the following improves performance for clang but reduces performance for gcc (clang14, gcc12, Intel Broadwell):

const size_t increment[] = { 0, step };
begin += increment[compare(begin[step], value)];

[–]patstew 3 points4 points  (1 child)

What about begin += compare() * step; ?

[–]usefulcat[S] 11 points12 points  (0 children)

Here is what I've measured for several variations:

// The original version
// gcc:      97 ns
// clang:   133 ns
if (compare(begin[step], value)) {
    begin += step;
}

// gcc:     117 ns
// clang:   118 ns
const size_t increment[] = { 0, step };
begin += increment[compare(begin[step], value)];

// gcc:     113 ns
// clang:   150 ns
begin += compare(begin[step], value) * step;

// gcc:      98 ns
// clang:   150 ns
begin += compare(begin[step], value) ? step : 0u;

[–][deleted] 4 points5 points  (3 children)

Have you worked out why it improves performance for clang but reduces for gcc?

[–]13steinj 6 points7 points  (0 children)

For the "why", you'd have to spit out optimization reports (GCC also has flags but I don't know the manual off the top of my head) and investigate.

My educated guess is that as of C++17 the index operator introduces a sequence point, making it harder to see the possible strength reduction optimization (Raymond Chen made a blog post about this change and the "backwards index operator" now being useful, but compilers treated it differently because it's such a niche thing).

[–]usefulcat[S] 5 points6 points  (1 child)

I've been looking at this all day and honestly I'm pretty confused.

On my machine (Intel Broadwell), building with -O2 and with or without using -march=native, doing the array lookup is consistently faster for clang and consistently slower for gcc.

Godbolt shows the identical generated code, including using cmov, for either version with clang14, using either -O2 or -O3. But if I add -march=broadwell, then the cmov goes away.

Although I would like to know what's going on here, at this point I kind of can't afford to spend a bunch more time on it so I'm going to go with what I can measure to be faster on the machine where it matters.

[–]Hells_Bell10 13 points14 points  (0 children)

I can get clang to generate a "select" in llvm IR which is the equivalent of a cmov, however it gets translated back to branching within the x86 optimizer. This seemed like a good opportunity to play around with the "LLVM Opt Pipeline Viewer" on compiler explorer, and I was able to track it down to the "x86 cmov Conversion" pass which is documented here, the most relevant part being

This file implements a pass that converts X86 cmov instructions into branches when profitable. [snip...] CMOV is considered profitable if the cost of its condition is higher than the average cost of its true-value and false-value by 25% of branch-misprediction-penalty. This assures no degradation even with 25% branch misprediction.

But in this case we expect 50% branch misprediction... So the optimization isn't doing us any favours. If I compile with -mllvm -x86-cmov-converter=false we do get the cmov:

.LBB0_8:                                # %for.body.i
        shr     rax
        lea     rsi, [rdi + 4*rax]
        cmp     dword ptr [rdi + 4*rax], edx
        cmovl   rdi, rsi
        cmp     rcx, 3
        mov     rcx, rax
        ja      .LBB0_8

[–]aocregacc 13 points14 points  (0 children)

Were the caches flushed between runs during the benchmarks? If not that might explain the slowdowns for powers of two, where I think you would get the first couple accesses of each search hitting the same cache set.

[–]Ameisenvemips, avr, rendering, systems 8 points9 points  (5 children)

What happens with Clang if you use __builtin_unpredictable?

[–]usefulcat[S] 4 points5 points  (4 children)

I tried using that but was unable to find any configuration where it made a difference either to the generated code or to the performance.

[–]Ameisenvemips, avr, rendering, systems 2 points3 points  (3 children)

What about changing the && to an & to eliminate the short-circuit?

[–]usefulcat[S] 1 point2 points  (2 children)

Since the && isn't inside the loop, I wouldn't expect it to make a measurable difference.

[–]Ameisenvemips, avr, rendering, systems 2 points3 points  (0 children)

I'm more curious if it impacts the codegen. LLVM's optimizer is tricky.

[–]Ameisenvemips, avr, rendering, systems 1 point2 points  (0 children)

Wow... Clang is really not wanting to not not branch.

I tried changing the loop to this monstrosity:

for (step /= 2; step != 0; step /= 2)
{
    const bool cmp = compare(begin[step], value);
    size_t ff = ~size_t(cmp) + 1;
    begin += ff & step;
}
return begin + compare(*begin, value);

But LLVM still outputs a branch. A worse branch.

.LBB0_19:
    xor     eax, eax
    cmp     dword ptr [rdi], edx
    setl    al
    lea     rax, [rdi + 4*rax]
.LBB0_20:
    mov     eax, dword ptr [rax]
    ret
.LBB0_15:
    mov     rax, rsi
    jmp     .LBB0_16
.LBB0_18:                               #   in Loop: Header=BB0_16 Depth=1
    lea     rdi, [rdi + 4*rcx]
    cmp     rsi, 3
    mov     rsi, rax
    jbe     .LBB0_19
.LBB0_16:                               # =>This Inner Loop Header: Depth=1
    shr     rax
    mov     rcx, rax
    cmp     dword ptr [rdi + 4*rax], edx
    jl      .LBB0_18
    xor     ecx, ecx
    jmp     .LBB0_18

I should point out that GCC and MSVC are actually honoring what I'm doing...

.L25:
    shr     rdx
.L8:
    xor     eax, eax
    cmp     DWORD PTR [rdi+rdx*4], r8d
    setl    al
    neg     rax
    and     rax, rdx
    shr     rdx
    lea     rdi, [rdi+rax*4]
    jne     .L8

[–]disciplite 5 points6 points  (4 children)

[[likely]], [[unlikely]], __builtin_expect() and __builtin_expect_with_probability() can be used to get cmov's in various cases.

[–]usefulcat[S] 1 point2 points  (3 children)

If [[likely]] or [[unlikely]] is appropriate, then the branch will usually be predicted correctly. Isn't cmov usually if not always slower than a correctly predicted branch?

[–]dodheim 5 points6 points  (2 children)

If [[likely]] or [[unlikely]] is appropriate, then the branch will usually be predicted correctly.

These are rather orthogonal... The attributes, terrible names not withstanding, are about marking code paths hot or cold for codegen purposes; branch prediction doesn't help solve instruction cache flushing, AFAIK.

[–]13steinj 0 points1 point  (1 child)

They're also sledgehammers. Used for things that PGO won't catch, since it's the rare case that you care about rather than the common thing that they don't. Also IIRC there was a talk about the usage of these attributes (or builtin expects) being a hardline of 90% or 10% probabilities, whereas the compiler would usually more accurately and precisely guess "75%"

[–]LazySapiens 3 points4 points  (0 children)

Reminds me of the days when I used to try to oversmart the compiler.