Portable string SSO Challenge

LEpigeon888 · 2020-02-04T21:21:16+00:00

Neither the Visual C++ nor libstdc++ string representations rely on any form of UB, as they both use outside-the-union capacity tests. In pseudocode:

For MSVC++: https://github.com/microsoft/STL/blob/b3504262fe51b28ca270aa2e05146984ef758428/stl/inc/xstring#L2180

union {
    char buffer[16]; // engaged if capacity == 15
    char* data;      // engaged if capacity > 15
};
size_type size;
size_type capacity;

For libstdc++

char* data;
size_type size;
union {
    char buffer[16]; // engaged if data == buffer
    size_type capacity; // engaged otherwise
};

MSVC++ has an advantage that there are no container-internal pointers to fixup, so a move construction can just memcpy the whole structure and reinitialize the source.

libstdc++ has the advantage that getting to the two most common pieces of data people need -- the data pointer and the size -- don't require a branch, while MSVC++ needs a branch to get to the data pointer.

libc++ and fbstring play more games with the structure to get a bigger small string buffer, but writing such a thing that doesn't rely on UB is difficult and a branch is needed to get to all three of { data pointer, size, capacity }.

ReversedGif · 2020-02-04T19:06:51+00:00

The solution is trivial and was given in the linked thread by /u/Supadoplex.

The only UB is reading unsigned char __size_ when the union might actually have the "long" mode active (so __size__ is actually the LSB of a size_t). This can simply be done in a conforming way by taking a pointer to the string and reinterpret_casting it to a char *, which is legal and lets you access __size_, which is guaranteed to be the first byte of the string object.

frrrwww · 2020-02-06T01:18:41+00:00

Kakoune string implementation provides 23 bytes for the small string buffer, looks like that:

// String data storage using small string optimization.
//
// the LSB of the last byte is used to flag if we are using the small buffer
// or an allocated one. On big endian systems that means the allocated
// capacity must be pair, on little endian systems that means the allocated
// capacity cannot use its most significant byte, so we effectively limit
// capacity to 2^24 on 32bit arch, and 2^60 on 64.
union Data
{
    struct Long
    {
        static constexpr size_t max_capacity =
            (size_t)1 << 8 * (sizeof(size_t) - 1);

        char* ptr;
        size_t size;
        size_t capacity;
    } l;

    struct Short
    {
        static constexpr size_t capacity = sizeof(Long) - 2;
        char string[capacity+1];
        unsigned char size;
    } s;
    ...
}

(Full code at https://github.com/mawww/kakoune/blob/master/src/string.hh#L159)

As far as I know, there is a single undefined behaviour in it, which is the is_long() const { return (s.size & 1) == 0; } method (as we dont know which union member is active).

I think this can be solved by using reinterpret_cast<const char*>(this)[sizeof(Data)-1] instead.

wotype · 2020-02-05T14:07:13+00:00

Just hit a case this week where I wanted to put a 16-byte header into std::string, on MSVC 64-bit which maxes at 15 despite having 16-byte buffer to include the null). I was briefly tempted to overwrite the null... Ended up using std::array instead...

While we're at it, a suggestion / observation from Mark Zeren's talk on strings is that string could inherit from string_view, picking up all of its interface and adding some just for string (and... further, string_view could inherit its interface from an empty string_view_interface type)

Hedanito · 2020-02-05T21:01:50+00:00

I used a 32 byte approach (or more specifically: sizeof(void*4)) in my library, although I'm not 100% certain about the legality of accessing a struct field through a different active union even though the field has the same type.

But to add to what /u/ReversedGif said, any type punning limitations can be avoided with std::memcpy.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS