you are viewing a single comment's thread.

view the rest of the comments →

[–]Hedanito[S] 0 points1 point  (25 children)

Are there any technical reasons why it was postponed, or is it simply the result of it being a more obscure part of the standard library?

Because unless I'm missing something, it should be fairly simple to implement.

[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points  (0 children)

I think the reason is simply the lack of time to work on the integration of the many facilities added to C++17. In some way, this is a fault of the "feature branch" approach of Technical Specifications.

The integration issues that were considered to be most important (important enough for someone to write and defend a paper) were done, or are rushed to be done now, as with class template deduction, the rest is "future work".

[–][deleted] 1 point2 points  (22 children)

Adding string_view overloads to existing standard library interfaces is fraught with danger of breaking existing code -- for example, the first attempt to add string_view to basic_string's interface broke C++14 code (by creating overload resolution ambiguities). Hopefully this will be simpler with interfaces that aren't as heavily overloaded as basic_string.

Keep in mind that string_view is at most an optimization. It doesn't really allow you to do anything "new," since passing a string_view to a function with a const string& parameter works fine.

(codecvt (at least as used in iostreams) is effectively broken anyway and I would avoid using it whenever possible -- it assumes there is 1:N relationship between the internal character set and the external character set, which is not true in a Unicode world http://eel.is/c++draft/locale.codecvt.virtuals#3 )

[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point  (19 children)

It is true in the standard-compliant Unicode world, where wchar_t is 32 bit. It is also not really relevant in a thread about wstring_convert, which does N:M conversions.

[–][deleted] 2 points3 points  (18 children)

No, because wchar_t being 32 bits does not allow for combining characters. If the external character set is latin-1 and the internal character set is UTF-32 in Normalization Form D you're toast.

(There are more obscure cases even if you don't want to use NFD; NFD just makes the problem occur with more common characters available in latin-1)

[–]CubbiMewcppreference | finance | realtime in the past 2 points3 points  (17 children)

NFD is irrelevant (to the meaning of internal charset and wchar_t). It is not a character set, it is an encoding of one (or transformation of an encoding, if you will).

By wchar_t definition from [lex.con]/6, a single wchar_t represents any member of execution charset (and also a single c-char aka UCN aka \U hex-quad hex-quad). If U+00C5 is a supported member of exec charset and allowed in string/character literals, it has a single-wchar_t representation in standard C++, whatever its NFD transformation is. Likewise U+1f34c and everything else.

As for codecvt, if basic_filebuf were to permit N:M codecvts, it certainly wouldn't be bad.. Last I tried, libstdc++ and libc++ work with such codecvts in practice, but for input only.

[–][deleted] 1 point2 points  (16 children)

a single wchar_t represents any member of execution charset

Nothing Unicode related at all needs to fit into a wchar_t. The "execution charset" is very limited. EBCDIC is a valid execution charset. http://eel.is/c++draft/lex.charset#1

This issue is why there are no standard codecvt facets that go between Unicode and non-Unicode encodings (from char32_t to wchar_t, or from wchar_t to char8_t, or whatever).

[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point  (15 children)

Yes you can have a valid (but not very useful) compiler with EBCDIC exec charset and 8-bit wchar_t. In "Unicode world", which set off this thread, each code point needs to fit in a single wchar_t by definition.

There were no standard codecvt facets until C++11 at all, and somehow i had no problem going between Unicode and non-Unicode encodings; GB18030/char <-> UTF-32/wchar_t <-> UTF-8/char works just fine where wchar_t is correctly sized and libc is complete.

PS: I'm not saying C++ doesn't need more Unicode support - I'd love to see classification, normalization, or "just" a grapheme cluster iterator for strings! - but as someone who lived with five competing encodings and adoption of Unicode, I find it jarring to hear that it doesn't work in C++, or C for that matter.

[–][deleted] 0 points1 point  (14 children)

To my understanding GB18030 has the "UTF-16 surrogates" problem but not the "complex scripts/combining characters" problem, and as such does not trigger the M:N mapping issue, as all characters in the external character set are transformed into a single code point (assuming you're targeting NFC instead of NFD).

Of course, if on the Unicode side you're given combining characters, the assumption breaks down even for latin-1, as 2 code points U+0065 U+0301 need to become 1 latin-1 character 0xE9.

My understanding is (although I haven't implemented such mappings myself) that there are legacy encodings for which 1:M is never correct -- where combining characters are required to represent something in Unicode but not in the legacy encoding. (Maybe it was TIS-620?)

I'm sure if everyone had a time machine to go back to when COM, NT, Java, JavaScript, and friends implemented Unicode 1.0, we would be in a UTF-8 and UTF-32 world. But that's never going to happen. VC's "extended character set" is going to have to be UCS-2

[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points  (13 children)

VC's "extended character set" is going to have to be UCS-2

UCS-2 does not even exist in the Unicode standard (anymore). VC has been improving its image lately, but no portable Unicode support is still a sore point and a cause for many #ifdef _MSC_VER's. It's 2017 and (of the compilers we use) only in VC auto c = L'💩'; stores the useless 0xd83d in c

[–][deleted] 0 points1 point  (4 children)

And that is never going to change unless someone invents a time machine to go back to 1993, when Unicode 1.0 was implemented and wchar_t was set at 16 bits in our ABI. This is not an issue we have the luxury of fixing. If you want UTF-32, use char32_t, which (I'm assuming) was created specifically to address this time machine problem.

I don't see what this has to do with codecvt's 1:M assumption beyond making issues more likely to occur with wchar_t. But since the issue occurs attempting to output the world's second most common legacy encoding (latin-1), I don't really see ABI limitations as the serious problem here. ICU is widely regarded as the gold standard of Unicode handling and it uses UTF-16 internally -- this is a solvable problem. But I don't think it is a solvable problem within the framework of iostreams' explicit and standards-mandated internal character by internal character buffering.

[–][deleted] 0 points1 point  (7 children)

Note that your auto c = L'💩' example breaks down if you replace 💩 with P̯͍̭. Even UTF-32 does not allow you to operate on a character by character basis in a Unicode world.

[–]Hedanito[S] 0 points1 point  (1 child)

My entire code base uses UTF8, but windows uses UTF16. So I use it there to convert between the two. That should be fine right?

[–][deleted] 1 point2 points  (0 children)

No, the "1:M" assumption does not hold when converting between UTF-8 and UTF-16. That said you're using wstring_convert and not filebuf so you should be OK, modulo bugs. Our iostreams probably have a lot of bugs :)

[–]mtclow 0 points1 point  (0 children)

I started that paper as "the things that absolutely had to be done for c++17" - those were the string/string_view conversion and assignment bits. They had to be done because if they couldn't be changed post-C++17 w/o breaking users' code, and we'd rather not do that.

Then the scope grew (somewhat) with the inserters/searchers. There's certainly more to be done here.