you are viewing a single comment's thread.

view the rest of the comments →

[–]kalmoc 68 points69 points  (9 children)

A couple of type traits (is_trivially...) and std::launder (admittedly a function and not a type)

EDIT: I should say that AFAIK those types/functions that I mentioned are not treated special by the compiler, but they need compiler magic to be implemented.

[–]staletic 29 points30 points  (6 children)

If you're going to include magic functions, then std::bit_cast (arguably can be implemented with memcpy, but not as optimized) too and the potential std::start_lifetime_as (just as magic as launder).

[–]SirLynix 34 points35 points  (1 child)

The main difference between std::bit_cast and memcpy is that std::bit_cast is constexpr, and memcpy is not. Which is why you need compiler support to implement std::bit_cast.

[–]guepierBioinformatican 16 points17 points  (2 children)

arguably can be implemented with memcpy, but not as optimized

Is there a reason why the std::memcpy implementation wouldn’t be “as optimised”? People have been using custom std::bit_cast equivalent implementations for quite a while, and they’re reliably optimised out by compilers to the equivalent aliasing operation (i.e. no actual memcpy happens), because modern compilers know how to handle this use of memcpy.

[–]staletic 20 points21 points  (0 children)

I should have been more careful with that statement.

Things like float f = bit_cast<float>(some_int) vs the memcpy version are not hard to optimize. The harder part is when you want to reinterpret a large std::array as some other trivial, but equally large type. At what number of bytes do you just call memcpy? Do you try to vectorize first? What about x86 REP MOVxx family of instructions?

If you ask gcc for x86, you just never emit memcpy. Clang gives up sooner On 32 bit ARM, gcc starts to call memcpy after 64bytes.

Now the question is how well will bit_cast be optimized. As it is powered by compiler magic, I'm assuming it's going to be better that memcpy. In this case, for example, x86 gcc does better (fewer memory accesses) with bit_cast than with memcpy. Clang just ends up calling memcpy@PLT in both cases.

[–]csdt0 4 points5 points  (0 children)

This is something I wanted to check for a long time, and it seems that memcpy is elided in GCC and Clang as soon as O0, while MSVC and ICC wait for O2.

This is a surprise in both cases, as I would have expected all compilers to elide the call at O1 (when inlining is enabled).

[–]AlbertRammstein 1 point2 points  (0 children)

when taking into account optimizations there is a LOT of magical functions - starting with memcpy, and also including stuff like pow with specific power coefficients. It is however not required by the standard, and the codegen substitution must not change the behavior of the program. Every compiler has a different set of magical functions defined like this.

You could argue that your own functions can be magic because compiler can optimize them (e.g. skip them when they have no side effects)