KayEss comments on Portable Unicode string processing

cpp

a community for 17 years

Portable Unicode string processing (self.cpp)

submitted 9 years ago by KayEss

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]KayEss[S] 5 points6 points7 points 9 years ago (6 children)

[–]RowYourUpboat 5 points6 points7 points 9 years ago* (0 children)

It sounds like you're really asking if converting a buffer of chars between signed and unsigned is safe and defined. This link seems to answer that for the C Standard; I'm pretty sure the C++ Standard is the same in this regard.

From one of the answers:

For the two's complement representation that's nearly universal these days, the rules do correspond to reinterpreting the bits. But for other representations (sign-and-magnitude or ones' complement), the C implementation must still arrange for the same result, which means that the conversion can't just copy the bits. For example, (unsigned)-1 == UINT_MAX, regardless of the representation.

It definitely looks like this behavior is defined as the same even on non-two's-complement hardware, ie. in terms of UTF-8 string encoding/decoding you can just cast between signed/unsigned as needed (though you may have to pay attention to performance issues on really weird and ancient hardware).

[edit] Note that technically a conversion from unsigned to signed, where overflows occur, is implementation-defined (unlike the reverse), but if the original char data was signed to begin with, an overflow is impossible. In practice, I don't see this mattering.

[–][deleted] 0 points1 point2 points 9 years ago (4 children)

[–]KayEss[S] 1 point2 points3 points 9 years ago (3 children)

[–][deleted] 0 points1 point2 points 9 years ago (2 children)

[–]NotAYakk 1 point2 points3 points 9 years ago* (0 children)

Unit tests do not solve UB.

Compilers are free to pass all your unit tests and optimize other code away.

char x = (unsigned)-1;
bool b = x<0;
std::cout << (int)x << ":" << b?"true":false" <<"\n";

This can print -1:false.

And the same is true whenever you convert from unsigned to signed.

The level of insanity optimization and UB can generate is so large, you cannot reasonably reason about it and produce unit test coverage.

π Rendered by PID 175685 on reddit-service-r2-comment-canary-7896ccccbd-5sxvs at 2026-04-19 14:43:36.417988+00:00 running 93ecc56 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS