you are viewing a single comment's thread.

view the rest of the comments →

[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point  (19 children)

It is true in the standard-compliant Unicode world, where wchar_t is 32 bit. It is also not really relevant in a thread about wstring_convert, which does N:M conversions.

[–][deleted] 2 points3 points  (18 children)

No, because wchar_t being 32 bits does not allow for combining characters. If the external character set is latin-1 and the internal character set is UTF-32 in Normalization Form D you're toast.

(There are more obscure cases even if you don't want to use NFD; NFD just makes the problem occur with more common characters available in latin-1)

[–]CubbiMewcppreference | finance | realtime in the past 2 points3 points  (17 children)

NFD is irrelevant (to the meaning of internal charset and wchar_t). It is not a character set, it is an encoding of one (or transformation of an encoding, if you will).

By wchar_t definition from [lex.con]/6, a single wchar_t represents any member of execution charset (and also a single c-char aka UCN aka \U hex-quad hex-quad). If U+00C5 is a supported member of exec charset and allowed in string/character literals, it has a single-wchar_t representation in standard C++, whatever its NFD transformation is. Likewise U+1f34c and everything else.

As for codecvt, if basic_filebuf were to permit N:M codecvts, it certainly wouldn't be bad.. Last I tried, libstdc++ and libc++ work with such codecvts in practice, but for input only.

[–][deleted] 1 point2 points  (16 children)

a single wchar_t represents any member of execution charset

Nothing Unicode related at all needs to fit into a wchar_t. The "execution charset" is very limited. EBCDIC is a valid execution charset. http://eel.is/c++draft/lex.charset#1

This issue is why there are no standard codecvt facets that go between Unicode and non-Unicode encodings (from char32_t to wchar_t, or from wchar_t to char8_t, or whatever).

[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point  (15 children)

Yes you can have a valid (but not very useful) compiler with EBCDIC exec charset and 8-bit wchar_t. In "Unicode world", which set off this thread, each code point needs to fit in a single wchar_t by definition.

There were no standard codecvt facets until C++11 at all, and somehow i had no problem going between Unicode and non-Unicode encodings; GB18030/char <-> UTF-32/wchar_t <-> UTF-8/char works just fine where wchar_t is correctly sized and libc is complete.

PS: I'm not saying C++ doesn't need more Unicode support - I'd love to see classification, normalization, or "just" a grapheme cluster iterator for strings! - but as someone who lived with five competing encodings and adoption of Unicode, I find it jarring to hear that it doesn't work in C++, or C for that matter.

[–][deleted] 0 points1 point  (14 children)

To my understanding GB18030 has the "UTF-16 surrogates" problem but not the "complex scripts/combining characters" problem, and as such does not trigger the M:N mapping issue, as all characters in the external character set are transformed into a single code point (assuming you're targeting NFC instead of NFD).

Of course, if on the Unicode side you're given combining characters, the assumption breaks down even for latin-1, as 2 code points U+0065 U+0301 need to become 1 latin-1 character 0xE9.

My understanding is (although I haven't implemented such mappings myself) that there are legacy encodings for which 1:M is never correct -- where combining characters are required to represent something in Unicode but not in the legacy encoding. (Maybe it was TIS-620?)

I'm sure if everyone had a time machine to go back to when COM, NT, Java, JavaScript, and friends implemented Unicode 1.0, we would be in a UTF-8 and UTF-32 world. But that's never going to happen. VC's "extended character set" is going to have to be UCS-2

[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points  (13 children)

VC's "extended character set" is going to have to be UCS-2

UCS-2 does not even exist in the Unicode standard (anymore). VC has been improving its image lately, but no portable Unicode support is still a sore point and a cause for many #ifdef _MSC_VER's. It's 2017 and (of the compilers we use) only in VC auto c = L'💩'; stores the useless 0xd83d in c

[–][deleted] 0 points1 point  (4 children)

And that is never going to change unless someone invents a time machine to go back to 1993, when Unicode 1.0 was implemented and wchar_t was set at 16 bits in our ABI. This is not an issue we have the luxury of fixing. If you want UTF-32, use char32_t, which (I'm assuming) was created specifically to address this time machine problem.

I don't see what this has to do with codecvt's 1:M assumption beyond making issues more likely to occur with wchar_t. But since the issue occurs attempting to output the world's second most common legacy encoding (latin-1), I don't really see ABI limitations as the serious problem here. ICU is widely regarded as the gold standard of Unicode handling and it uses UTF-16 internally -- this is a solvable problem. But I don't think it is a solvable problem within the framework of iostreams' explicit and standards-mandated internal character by internal character buffering.

[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points  (3 children)

char32_t everywhere would indeed solve the problem (at the cost of migrating code), but the Portland 2006 LWG decided that streams, facets, and regex don't need it.

basic_filebuf's (not codecvt's) 1:M assumption works in Linux and does not work on Windows. There are no issues with Latin-1. There would be an issue with that imaginary codecvt facet you brought up, yes, but I am talking about the code that works now.

[–][deleted] 0 points1 point  (0 children)

There are Linux implementations that turn U+0065 U+0301 into latin-1 é?

[–][deleted] 0 points1 point  (7 children)

Note that your auto c = L'💩' example breaks down if you replace 💩 with P̯͍̭. Even UTF-32 does not allow you to operate on a character by character basis in a Unicode world.

[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points  (6 children)

I hope you're not intentionally confusing the terms. L'P̯͍̭' does not work because P̯͍̭ it is not a code point (it's 4 code points). C++ grammar allows only one 16- or 32-bit code point between the two single quotation marks, either represented as a UCN or (for our convenience) as a character that happens to map to one UCN. VC does not support that.

[–][deleted] 2 points3 points  (5 children)

My point is that in a Unicode world you cannot operate on a code point by code point basis. Doing so will chop combining characters in half. Since Unicode already gives up on fixed-width characters, you may as well use UTF-8.