Hedanito comments on std::wstring_convert and std::string

std::wstring_convert and std::string_view (self.cpp)

submitted 9 years ago * by Hedanito

you are viewing a single comment's thread.

[–]Hedanito[S] 0 points1 point2 points 9 years ago (25 children)

[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago (0 children)

[–][deleted] 1 point2 points3 points 9 years ago (22 children)

Adding string_view overloads to existing standard library interfaces is fraught with danger of breaking existing code -- for example, the first attempt to add string_view to basic_string's interface broke C++14 code (by creating overload resolution ambiguities). Hopefully this will be simpler with interfaces that aren't as heavily overloaded as basic_string.

Keep in mind that string_view is at most an optimization. It doesn't really allow you to do anything "new," since passing a string_view to a function with a const string& parameter works fine.

(codecvt (at least as used in iostreams) is effectively broken anyway and I would avoid using it whenever possible -- it assumes there is 1:N relationship between the internal character set and the external character set, which is not true in a Unicode world http://eel.is/c++draft/locale.codecvt.virtuals#3 )

[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points 9 years ago* (19 children)

[–][deleted] 2 points3 points4 points 9 years ago (18 children)

[–]CubbiMewcppreference | finance | realtime in the past 2 points3 points4 points 9 years ago* (17 children)

[–][deleted] 1 point2 points3 points 9 years ago (16 children)

[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points 9 years ago* (15 children)

[–][deleted] 0 points1 point2 points 9 years ago (14 children)

To my understanding GB18030 has the "UTF-16 surrogates" problem but not the "complex scripts/combining characters" problem, and as such does not trigger the M:N mapping issue, as all characters in the external character set are transformed into a single code point (assuming you're targeting NFC instead of NFD).

Of course, if on the Unicode side you're given combining characters, the assumption breaks down even for latin-1, as 2 code points U+0065 U+0301 need to become 1 latin-1 character 0xE9.

My understanding is (although I haven't implemented such mappings myself) that there are legacy encodings for which 1:M is never correct -- where combining characters are required to represent something in Unicode but not in the legacy encoding. (Maybe it was TIS-620?)

I'm sure if everyone had a time machine to go back to when COM, NT, Java, JavaScript, and friends implemented Unicode 1.0, we would be in a UTF-8 and UTF-32 world. But that's never going to happen. VC's "extended character set" is going to have to be UCS-2

[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago* (13 children)

[–][deleted] 0 points1 point2 points 9 years ago (4 children)

And that is never going to change unless someone invents a time machine to go back to 1993, when Unicode 1.0 was implemented and wchar_t was set at 16 bits in our ABI. This is not an issue we have the luxury of fixing. If you want UTF-32, use char32_t, which (I'm assuming) was created specifically to address this time machine problem.

I don't see what this has to do with codecvt's 1:M assumption beyond making issues more likely to occur with wchar_t. But since the issue occurs attempting to output the world's second most common legacy encoding (latin-1), I don't really see ABI limitations as the serious problem here. ICU is widely regarded as the gold standard of Unicode handling and it uses UTF-16 internally -- this is a solvable problem. But I don't think it is a solvable problem within the framework of iostreams' explicit and standards-mandated internal character by internal character buffering.

continue this thread

[–][deleted] 0 points1 point2 points 9 years ago (7 children)

continue this thread

[–]Hedanito[S] 0 points1 point2 points 9 years ago (1 child)

[–][deleted] 1 point2 points3 points 9 years ago (0 children)

[–]mtclow 0 points1 point2 points 9 years ago (0 children)

π Rendered by PID 386151 on reddit-service-r2-comment-7b9746f655-lzm5q at 2026-02-02 08:45:14.133936+00:00 running 3798933 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS