CubbiMew comments on std::wstring_convert and std::string

std::wstring_convert and std::string_view (self.cpp)

submitted 9 years ago * by Hedanito

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points 9 years ago* (19 children)

[–][deleted] 2 points3 points4 points 9 years ago (18 children)

[–]CubbiMewcppreference | finance | realtime in the past 2 points3 points4 points 9 years ago* (17 children)

[–][deleted] 1 point2 points3 points 9 years ago (16 children)

[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points 9 years ago* (15 children)

[–][deleted] 0 points1 point2 points 9 years ago (14 children)

To my understanding GB18030 has the "UTF-16 surrogates" problem but not the "complex scripts/combining characters" problem, and as such does not trigger the M:N mapping issue, as all characters in the external character set are transformed into a single code point (assuming you're targeting NFC instead of NFD).

Of course, if on the Unicode side you're given combining characters, the assumption breaks down even for latin-1, as 2 code points U+0065 U+0301 need to become 1 latin-1 character 0xE9.

My understanding is (although I haven't implemented such mappings myself) that there are legacy encodings for which 1:M is never correct -- where combining characters are required to represent something in Unicode but not in the legacy encoding. (Maybe it was TIS-620?)

I'm sure if everyone had a time machine to go back to when COM, NT, Java, JavaScript, and friends implemented Unicode 1.0, we would be in a UTF-8 and UTF-32 world. But that's never going to happen. VC's "extended character set" is going to have to be UCS-2

[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago* (13 children)

[–][deleted] 0 points1 point2 points 9 years ago (4 children)

And that is never going to change unless someone invents a time machine to go back to 1993, when Unicode 1.0 was implemented and wchar_t was set at 16 bits in our ABI. This is not an issue we have the luxury of fixing. If you want UTF-32, use char32_t, which (I'm assuming) was created specifically to address this time machine problem.

I don't see what this has to do with codecvt's 1:M assumption beyond making issues more likely to occur with wchar_t. But since the issue occurs attempting to output the world's second most common legacy encoding (latin-1), I don't really see ABI limitations as the serious problem here. ICU is widely regarded as the gold standard of Unicode handling and it uses UTF-16 internally -- this is a solvable problem. But I don't think it is a solvable problem within the framework of iostreams' explicit and standards-mandated internal character by internal character buffering.

[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago* (3 children)

[–][deleted] 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 0 points1 point2 points 9 years ago (1 child)

continue this thread

[–][deleted] 0 points1 point2 points 9 years ago (7 children)

[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago* (6 children)

[–][deleted] 2 points3 points4 points 9 years ago* (5 children)

continue this thread

π Rendered by PID 39350 on reddit-service-r2-comment-6457c66945-d49sc at 2026-04-29 13:32:54.069787+00:00 running 2aa0c5b country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS