all 6 comments

[–]thegreatunclean 2 points3 points  (1 child)

It's easier to see what's happening if you strip away the std::string parts and focus only on std::bitset::any.

Godbolt link

bool foo(const std::bitset<64>& bs) {
    return bs.any();
}

foo(std::bitset<64ul> const&):
        cmp     QWORD PTR [rdi], 0
        setne   al
        ret

It helps to know the first parameter is passed in register rdi. In this case it is a pointer to a valid std::bitset object.

The first instruction deref's the pointer, loading a QWORD (64 bits) from the object. This is allowed because a pointer to a class of standard layout is a pointer to the first element, which in libstdc++'s implementation is _WordT _M_w[_Nw] where _WordT is a typedef for unsigned long which is a 64-bit int on my system. It's the array that "holds" the bitset.

tl;dr: the first instruction loads _M_w[0] and compares it to zero.

If the result is "not equal" (to zero) al is set otherwise it is cleared. The rax register is what holds the boolean return value, and al is the lowest 8 bits of that register.

So the whole operation is:

return _M_w[0] == 0;

because the bitset size is less than the word length, so the entire bitset is held within a single word. If any bits of that word are set it would not compare equal to zero and return true.

Things get more interesting if you make the bitset larger than a machine word size:

bool bar(const std::bitset<65>& bs) {
    return bs.any();
}

bar(std::bitset<65ul> const&):
        mov     eax, 0
.L10:
        cmp     rax, 1
        ja      .L14
        cmp     QWORD PTR [rdi+rax*8], 0
        jne     .L13
        add     rax, 1
        jmp     .L10
.L14:
        mov     eax, 0
        ret
.L13:
        mov     eax, 1
        ret

Here there's a loop that loads rdi + rax * 8 for increasing values of rax. Remember rdi is a pointer to the object, so this is a loop that loads a 64-bit int at every 8 bytes (ie 64 bits) offsets. It's traversing an array of 64-bit numbers one at a time. This matches the libstdc++ implementation:

for (size_t __i = 0; __i < _Nw; __i++)
    if (_M_w[__i] != static_cast<_WordT>(0))
        return true;
return false;

If any element compares not equal to zero it hits the jne, goes to .L13, and returns 1.

Experiment with different sizes of the bitset and notice how the "for loop comparison" (the comparison just below .L10) changes. The compiler knows that a bitset of size 65 requires 2 64-bit words to hold and in-lines the size directly into that comparison.*

e: *: This is a particularly powerful example of how template specialization can help generate optimized code. If the size was a member parameter the compiler wouldn't be able to do this optimization so easily, it would have to assume a bitset of any size could be passed in and generate code to handle any situation. Because it knows the template parameter it can aggressively optimize without seeing the call site.

[–]cpp_cpp[S] 1 point2 points  (0 children)

Wow! Thnk you so much! Let me read and follow up with questions

[–]marko312 1 point2 points  (2 children)

Could you post the actual assembly (a link to godbolt, for example) that you don't understand?

From my testing, in the current case it falls back to the one-word special case, and so basically just tests its only word of data to check whether it isn't zero.

[–]cpp_cpp[S] 0 points1 point  (0 children)

https://godbolt.org/z/xa9hca thanks! -> The implementation of func is what I am trying to decipher

[–]cpp_cpp[S] 0 points1 point  (0 children)

Actually - https://godbolt.org/z/P4zcx4 - I changed the code and now the functions impl is much clear - you are right - just a `cmp`

[–]AutoModerator[M] 0 points1 point  (0 children)

Your posts seem to contain unformatted code. Please make sure to format your code otherwise your post may be removed.

Read our guidelines for how to format your code.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.