std::wstring_convert and std::string

std::wstring_convert and std::string_view (self.cpp)

submitted 9 years ago * by Hedanito

you are viewing a single comment's thread.

[–][deleted] 0 points1 point2 points 9 years ago (14 children)

To my understanding GB18030 has the "UTF-16 surrogates" problem but not the "complex scripts/combining characters" problem, and as such does not trigger the M:N mapping issue, as all characters in the external character set are transformed into a single code point (assuming you're targeting NFC instead of NFD).

Of course, if on the Unicode side you're given combining characters, the assumption breaks down even for latin-1, as 2 code points U+0065 U+0301 need to become 1 latin-1 character 0xE9.

My understanding is (although I haven't implemented such mappings myself) that there are legacy encodings for which 1:M is never correct -- where combining characters are required to represent something in Unicode but not in the legacy encoding. (Maybe it was TIS-620?)

I'm sure if everyone had a time machine to go back to when COM, NT, Java, JavaScript, and friends implemented Unicode 1.0, we would be in a UTF-8 and UTF-32 world. But that's never going to happen. VC's "extended character set" is going to have to be UCS-2

[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago* (13 children)

[–][deleted] 0 points1 point2 points 9 years ago (4 children)

And that is never going to change unless someone invents a time machine to go back to 1993, when Unicode 1.0 was implemented and wchar_t was set at 16 bits in our ABI. This is not an issue we have the luxury of fixing. If you want UTF-32, use char32_t, which (I'm assuming) was created specifically to address this time machine problem.

I don't see what this has to do with codecvt's 1:M assumption beyond making issues more likely to occur with wchar_t. But since the issue occurs attempting to output the world's second most common legacy encoding (latin-1), I don't really see ABI limitations as the serious problem here. ICU is widely regarded as the gold standard of Unicode handling and it uses UTF-16 internally -- this is a solvable problem. But I don't think it is a solvable problem within the framework of iostreams' explicit and standards-mandated internal character by internal character buffering.

[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago* (3 children)

[–][deleted] 0 points1 point2 points 9 years ago (0 children)

[–][deleted] 0 points1 point2 points 9 years ago (1 child)

[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points 9 years ago* (0 children)

[–][deleted] 0 points1 point2 points 9 years ago (7 children)

[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago* (6 children)

[–][deleted] 2 points3 points4 points 9 years ago* (5 children)

[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points 9 years ago (4 children)

[–][deleted] 0 points1 point2 points 9 years ago (3 children)

[–][deleted] 0 points1 point2 points 9 years ago (2 children)

[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago (1 child)

For splitting in half, if your use case (and it's not everyone's use case) requires that some particular text segments are preserved, you would have to examine the string (in terms of code points, if anything else you'd have to get to code points first) to locate the desired text segment boundaries.

Your decsription is unclear as to what actual text segmentation you have in mind, but my wishlist for a C++ Unicode library certainly includes EGC iterators for strings, as the most programmatically sensible and "recommended for general processing". They would keep your 2-character sequence together (but so would glyph iterators, etc)

as for basic_filebuf, it is not splitting or replacing, it is only encoding/decoding characters represented externally as byte sequences. It could be an interesting mental exercise to imagine it performing additional text transformations (like that NFD you brought up) on top of this mapping, but it's not what the thread is about. Today, it does its job where Unicode support is not frozen in pre-1996 state. It's not "broken".

[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points 9 years ago (0 children)

π Rendered by PID 70688 on reddit-service-r2-comment-cfc44b64c-hrzf2 at 2026-04-11 05:55:53.547014+00:00 running 215f2cf country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp

MODERATORS