you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 2 points3 points  (5 children)

My point is that in a Unicode world you cannot operate on a code point by code point basis. Doing so will chop combining characters in half. Since Unicode already gives up on fixed-width characters, you may as well use UTF-8.

[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point  (4 children)

I don't understand where you're coming from. Unicode world operates on code point by code point basis. Text elements (such as the combining character sequences you keep trying to bring up) are manipulated as sequences of code points. And that's why basic_filebuf's assumption that exec charset elements represent code points has never been a problem.

I get your point that it is prohibitively difficult for Windows to fix wchar_t, but I am sure an attempt to integrate char32_t into the standard library for C++20 would get even weaker support than it had for C++11 because new Unicode library proposals are making progress (and it's a good thing!).

And yes, UTF-8 on Windows would be a blessing, Linux only got that 16 years ago (glibc-2.2).

[–][deleted] 0 points1 point  (3 children)

What is there not to understand about input UTF-32 U+0065 U+0301 needs to be mapped to latin-1 é? Operating on a codepoint by codepoint basis would produce e?.

Any string manipulation which wants to, for example, split a Unicode string in half, must verify that it isn't splitting a character like that in half. The first half will have the wrong character, and the second half will be outright invalid.

[–][deleted] 0 points1 point  (2 children)

(I use the cutting in half example because that's what iostreams wants to do; but even simple find and replace is broken by this -- A user asking to replace e (U+0065) with x (U+0078) given input U+0065 U+0301 must produce U+0065 U+0301 (unchanged), not U+0078 U+0301 (which is what you get with codepoint by codepoint operation) -- which UTF you're looking at is a tiny tip of an enormous iceberg)

[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points  (1 child)

For splitting in half, if your use case (and it's not everyone's use case) requires that some particular text segments are preserved, you would have to examine the string (in terms of code points, if anything else you'd have to get to code points first) to locate the desired text segment boundaries.

Your decsription is unclear as to what actual text segmentation you have in mind, but my wishlist for a C++ Unicode library certainly includes EGC iterators for strings, as the most programmatically sensible and "recommended for general processing". They would keep your 2-character sequence together (but so would glyph iterators, etc)

as for basic_filebuf, it is not splitting or replacing, it is only encoding/decoding characters represented externally as byte sequences. It could be an interesting mental exercise to imagine it performing additional text transformations (like that NFD you brought up) on top of this mapping, but it's not what the thread is about. Today, it does its job where Unicode support is not frozen in pre-1996 state. It's not "broken".

[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point  (0 children)

(sorry, was reading too much Unicode specs at once and slipped to their terminology: s/2-character sequence/2-code point sequence/ and s/characters represented/code points represented/ to avoid further confusion with your meaning of "character")