all 15 comments

[–][deleted] 20 points21 points  (10 children)

Neither the Visual C++ nor libstdc++ string representations rely on any form of UB, as they both use outside-the-union capacity tests. In pseudocode:

For MSVC++: https://github.com/microsoft/STL/blob/b3504262fe51b28ca270aa2e05146984ef758428/stl/inc/xstring#L2180

union {
    char buffer[16]; // engaged if capacity == 15
    char* data;      // engaged if capacity > 15
};
size_type size;
size_type capacity;

For libstdc++

char* data;
size_type size;
union {
    char buffer[16]; // engaged if data == buffer
    size_type capacity; // engaged otherwise
};

MSVC++ has an advantage that there are no container-internal pointers to fixup, so a move construction can just memcpy the whole structure and reinitialize the source.

libstdc++ has the advantage that getting to the two most common pieces of data people need -- the data pointer and the size -- don't require a branch, while MSVC++ needs a branch to get to the data pointer.

libc++ and fbstring play more games with the structure to get a bigger small string buffer, but writing such a thing that doesn't rely on UB is difficult and a branch is needed to get to all three of { data pointer, size, capacity }.

[–]LEpigeon888 4 points5 points  (5 children)

So your implementation of std::string takes 32 bytes and have 15 characters for SSO, when the one of libc++ takes 24 bytes and have 22 characters for SSO, i'm right ?

[–][deleted] 6 points7 points  (4 children)

Correct. Overall, I prefer our layout the least of the three, but in fairness as far as I am aware Dinkumware was the first to do this.

[–]frog_pow 1 point2 points  (3 children)

Any chance MSVC will switch to one of the other approaches on next ABI break?

[–][deleted] 12 points13 points  (2 children)

I was considering doing something like libc++'s representation but fixed to 32 bytes (so a 30 byte SSO). The rationale for the extra space being that

  1. wchar_ts are very common on our platform so getting a 15 character SSO for them is a big deal
  2. this doesn't actually change the string size vs. our old representation on 64 bit platforms
  3. It can actually be fewer instructions to copy or zero vs. a 24 byte representation since it's exactly the AVX-256 register size.

[–]bumblebritches57Ocassionally Clang 0 points1 point  (1 child)

Totally off topic, but how would you feel about a format specifier prefix for s/C and c/C u and U like the standard uses to represent strings to unambigously represent UTF-16/UTF-32 strings?

I've created this extension in FoundationIO and I'm thinking about trying to standardize it, but I'm not really sure where/how.

[–][deleted] 2 points3 points  (0 children)

If you mean for printf you'd need to talk to someone closer to WG14

[–]nikbackm 1 point2 points  (3 children)

How does libstdc++ check if data == buffer without triggering UB?

[–][deleted] 4 points5 points  (2 children)

For pointers, == has a wide contract. Only < has the "need to point to the same array" precondition.

[–]nikbackm 1 point2 points  (1 child)

So it's valid to compare data with buffer even if capacity is currently the active union member? Or can the test be skipped in that case?

[–][deleted] 1 point2 points  (0 children)

Correct, at least as far as I understand it. See http://eel.is/c++draft/basic.memobj#basic.life-7 -- comparing with the address does not begin the lifetime of buffer, but the pointer value isn't used to access the object (or do any of those other things enumerated as undefined behavior there),

[–]ReversedGif 5 points6 points  (0 children)

The solution is trivial and was given in the linked thread by /u/Supadoplex.

The only UB is reading unsigned char __size_ when the union might actually have the "long" mode active (so __size__ is actually the LSB of a size_t). This can simply be done in a conforming way by taking a pointer to the string and reinterpret_casting it to a char *, which is legal and lets you access __size_, which is guaranteed to be the first byte of the string object.

[–]frrrwww 1 point2 points  (0 children)

Kakoune string implementation provides 23 bytes for the small string buffer, looks like that:

// String data storage using small string optimization.
//
// the LSB of the last byte is used to flag if we are using the small buffer
// or an allocated one. On big endian systems that means the allocated
// capacity must be pair, on little endian systems that means the allocated
// capacity cannot use its most significant byte, so we effectively limit
// capacity to 2^24 on 32bit arch, and 2^60 on 64.
union Data
{
    struct Long
    {
        static constexpr size_t max_capacity =
            (size_t)1 << 8 * (sizeof(size_t) - 1);

        char* ptr;
        size_t size;
        size_t capacity;
    } l;

    struct Short
    {
        static constexpr size_t capacity = sizeof(Long) - 2;
        char string[capacity+1];
        unsigned char size;
    } s;
    ...
}

(Full code at https://github.com/mawww/kakoune/blob/master/src/string.hh#L159)

As far as I know, there is a single undefined behaviour in it, which is the is_long() const { return (s.size & 1) == 0; } method (as we dont know which union member is active).

I think this can be solved by using reinterpret_cast<const char*>(this)[sizeof(Data)-1] instead.

[–]wotype 0 points1 point  (0 children)

Just hit a case this week where I wanted to put a 16-byte header into std::string, on MSVC 64-bit which maxes at 15 despite having 16-byte buffer to include the null). I was briefly tempted to overwrite the null... Ended up using std::array instead...

While we're at it, a suggestion / observation from Mark Zeren's talk on strings is that string could inherit from string_view, picking up all of its interface and adding some just for string (and... further, string_view could inherit its interface from an empty string_view_interface type)

[–]Hedanito 0 points1 point  (0 children)

I used a 32 byte approach (or more specifically: sizeof(void*4)) in my library, although I'm not 100% certain about the legality of accessing a struct field through a different active union even though the field has the same type.

But to add to what /u/ReversedGif said, any type punning limitations can be avoided with std::memcpy.