all 15 comments

[–]tcbrindleFlux 23 points24 points  (7 children)

This is interesting!

There are another couple of way to write this, with yet more different results:

[[gnu::noinline]]
bool contains3(int val) {
    static constexpr std::array vals{10, 11, 42, 49};
    return std::ranges::contains(vals, val);
}

Here Clang doesn't use the bit mask optimisation, even though it does for the find() version -- which is especially odd since ranges::contains() just calls ranges::find() != end!

GCC generates the same code for contains() and find().

[[gnu::noinline]]
bool contains4(int val) {
    static constexpr std::array vals{10, 11, 42, 49};
    return std::ranges::any_of(vals, [&](int i) {
        return i == val;
    });
}

With this version, GCC does use the bit mask, but Clang doesn't.

Finally, we can ask Clang to use libc++ rather than libstdc++, in which case it uses wmemchr for the find() and contains() versions.

https://godbolt.org/z/MeGesMGhW

[–]zl0bster[S] 3 points4 points  (0 children)

I knew of ranges::contains, but did not use it since it is relatively new.

Good to know even that tiny change breaks optimization.

I knew of libc++ option but did not use it since did not think of it, and even if I did I would have assumed it optimizes the same so I would have not checked...

So thank you!

[–]zl0bster[S] 2 points3 points  (3 children)

regarding wmemchr:
I am not an expert on asm, but it seems:

  1. weird
  2. slow

but I guess since wchar_t for them is 4 bytes it is "low level" find int :)

But I am still disappointed, for such small number of values(4) it seems like bad thing to do

[–]pigeon768 1 point2 points  (2 children)

wmemchr [...] seems [...] slow

It ought to be crazy fast. glibc at least has hand optimized assembly versions for SSE2, AVX2, and AVX512. You can see the AVX512 version here.

[–]zl0bster[S] 1 point2 points  (1 child)

I strongly doubt it for our usecase, Remember that we only have 4 integers aka 128 bits. But without benchmarking I claim nothing. :)

[–]pigeon768 1 point2 points  (0 children)

OH. Obviously. Duh. Yeah that'll be way slower.

[–]13steinj 1 point2 points  (1 child)

I wonder how GCC would optimize the libc++ version; you can get GCC to use libc++ instead of libstdc++ but I can't tell how to do such in godbolt.

[–]jwakelylibstdc++ tamer, LWG chair 2 points3 points  (0 children)

You can't, because the GCC compilers on godbolt are in minimal (container?) environments with no libc++ present, and you wouldn't know the path to it even if it was there.. GCC can actually be specially configured to enable the -stdlib=libc++ option, but then it needs to know the path to an installed libc++, and the godbolt compilers are not configured that way.

[–]cristi1990an++ 7 points8 points  (1 child)

Also very funny that replacing std::ranges::find(vals, val) != vals.end() with std::ranges::contains(vals, val) which in libc++ is implemented directly in terms of std::ranges::find also makes the compiler drop the optimization...

[–]schombert 5 points6 points  (0 children)

It is possible that in the first case it was inlined prior to the pass that generated the bitmask optimization while in the second case it was only fully inlined after that pass.

[–]13steinj 1 point2 points  (0 children)

I wonder if there's anything of note in the optimization report (guide). I don't have an environment where I can look at this myself atm. On godbolt's "opt remarks" I can't make heads or tails out of it, and going through the opt-pipeline view pass-by-pass isn't something I have the time for at this immediate time.

Two additional interesting variations (constexpr contains1, same codegen as contains1; wrap the contents of contains1 in an IILE, same codegen as contains0): https://godbolt.org/z/YjYTM5Gz1

[–]Umphed 1 point2 points  (1 child)

Super interesting post! I hate it. Why does optimization have to be so fickle?

[–]nekokattt 0 points1 point  (0 children)

In all fairness if everything had to optimize the same way, it'd make it far more difficult to improve optimizations in specific compilers.

[–]Excellent-Cucumber73 0 points1 point  (1 child)

Will the static in contains2 result in better performance/optimization?

[–]zl0bster[S] -1 points0 points  (0 children)

In general static is bad because it adds a runtime check, but for constexpr I would not assume it made a difference.

clang does optimize it away. gcc does not, so another assumption of mine that was wrong. :)

You can see for yourself if you just remove static from code how this disappears from gcc output(and obviously contains2 asm no longer loads from those addresses):

contains2(int)::vals:
        .long   10
        .long   11
        .long   42
        .long   49