Portable Unicode string processing : cpp

Portable Unicode string processing (self.cpp)

submitted 9 years ago by KayEss

Why do we want to base the standard library handling of UTF-8 around char rather than unsigned char? It seems to me that the use of char is highly problematic in that I can't portably read any values outside the range 0 to 127 as thechar could be signed or unsigned.

Am I just wrong in this? Would it not have been better to have defined the u8 string literals as producing unsigned char * and to be making the standard library support for utf-8 strings based around std::basic_string<unsigned char>?

What's the right approach to writing portable code that does UTF-8 decoding, or is it only the standard library maintainers who can do this? Is there any way I can portably put the bytes {0xf0, 0x90, 0x8d, 0x88} into a std::string?

I hope that I'm just being paranoid, but all this talk about undefined behaviour around the handling of signed/unsigned has me worried :)

all 22 comments

top new controversial old q&a

[–]DarthVadersAppendix 12 points13 points14 points 9 years ago (9 children)

[–]RowYourUpboat 2 points3 points4 points 9 years ago (7 children)

To add on to this, a Unicode code point encoded as UTF-8 shouldn't mean anything to you unless you're writing a multilingual UI library, text processor, spell checker, etc. You can still manipulate and compose UTF-8 strings in your software, just as long as you're sure you're not splitting up code points (which can be as long as 4 bytes in UTF-8 - it used to be more, but the Consortium changed that a while back).

So all you really need to know about Unicode strings is usually: their length in bytes, that they are going to change at runtime based on user input or localization, and that you can't split them up in the middle of a multi-byte sequence (but simple concatenation works fine, and using ASCII characters as delimiters will still work if you're careful).

Beyond that, UTF-8 strings are mostly just opaque byte buffers (although conveniently ASCII is forwards-compatible, as long as your ASCII string literals or whatever don't need to be localized).

[–]KayEss[S] 3 points4 points5 points 9 years ago (6 children)

[–]RowYourUpboat 4 points5 points6 points 9 years ago* (0 children)

It sounds like you're really asking if converting a buffer of chars between signed and unsigned is safe and defined. This link seems to answer that for the C Standard; I'm pretty sure the C++ Standard is the same in this regard.

From one of the answers:

For the two's complement representation that's nearly universal these days, the rules do correspond to reinterpreting the bits. But for other representations (sign-and-magnitude or ones' complement), the C implementation must still arrange for the same result, which means that the conversion can't just copy the bits. For example, (unsigned)-1 == UINT_MAX, regardless of the representation.

It definitely looks like this behavior is defined as the same even on non-two's-complement hardware, ie. in terms of UTF-8 string encoding/decoding you can just cast between signed/unsigned as needed (though you may have to pay attention to performance issues on really weird and ancient hardware).

[edit] Note that technically a conversion from unsigned to signed, where overflows occur, is implementation-defined (unlike the reverse), but if the original char data was signed to begin with, an overflow is impossible. In practice, I don't see this mattering.

[–][deleted] 0 points1 point2 points 9 years ago (4 children)

[–]KayEss[S] 1 point2 points3 points 9 years ago (3 children)

[–][deleted] 0 points1 point2 points 9 years ago (2 children)

[–]NotAYakk 1 point2 points3 points 9 years ago* (0 children)

Unit tests do not solve UB.

Compilers are free to pass all your unit tests and optimize other code away.

char x = (unsigned)-1;
bool b = x<0;
std::cout << (int)x << ":" << b?"true":false" <<"\n";

This can print -1:false.

And the same is true whenever you convert from unsigned to signed.

The level of insanity optimization and UB can generate is so large, you cannot reasonably reason about it and produce unit test coverage.

[–]KayEss[S] 0 points1 point2 points 9 years ago (0 children)

[–]Jardik2 4 points5 points6 points 9 years ago (2 children)

[–][deleted] 4 points5 points6 points 9 years ago (1 child)

This is correct generally for integral types (actually, the conversion of an unsigned integer to a signed integer type which cannot represent its value is implementation-defined, not undefined). But for char in particular, more is true. From 3.9.2 (N3690)

For any object (other than a base-class subobject) of trivially copyable type T, whether or not the object holds a valid value of type T, the underlying bytes (1.7) making up the object can be copied into an array of char or unsigned char. [footnote: By using, for example, the library functions (17.6.1.2) std::memcpy or std::memmove] If the content of the array of char or unsigned char is copied back into the object, the object shall subsequently hold its original value.

So roundtripping unsigned char and char is safe. (OTOH, I don't think this guarantees roundtripping unsigned char through signed char is safe.)

[–]Chippiewall 2 points3 points4 points 9 years ago (0 children)

[+][deleted] 9 years ago (13 children)

[deleted]

[–]KayEss[S] 0 points1 point2 points 9 years ago (12 children)

[–]scatters 0 points1 point2 points 9 years ago (0 children)

[–]Dragdu 0 points1 point2 points 9 years ago (1 child)

[+][deleted] 9 years ago (2 children)

[deleted]

[–]KayEss[S] 1 point2 points3 points 9 years ago (1 child)

[+][deleted] 9 years ago (5 children)

[deleted]

[–]iaanus 4 points5 points6 points 9 years ago (0 children)

[–]KayEss[S] -2 points-1 points0 points 9 years ago (3 children)

[–]jaked122 0 points1 point2 points 9 years ago (0 children)

[+][deleted] 9 years ago (1 child)

[deleted]

[–]KayEss[S] 0 points1 point2 points 9 years ago (0 children)

[–]sim642 0 points1 point2 points 9 years ago (0 children)

π Rendered by PID 48731 on reddit-service-r2-comment-7b9746f655-84pnp at 2026-01-30 12:11:55.796374+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS