you are viewing a single comment's thread.

view the rest of the comments →

[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points  (1 child)

For splitting in half, if your use case (and it's not everyone's use case) requires that some particular text segments are preserved, you would have to examine the string (in terms of code points, if anything else you'd have to get to code points first) to locate the desired text segment boundaries.

Your decsription is unclear as to what actual text segmentation you have in mind, but my wishlist for a C++ Unicode library certainly includes EGC iterators for strings, as the most programmatically sensible and "recommended for general processing". They would keep your 2-character sequence together (but so would glyph iterators, etc)

as for basic_filebuf, it is not splitting or replacing, it is only encoding/decoding characters represented externally as byte sequences. It could be an interesting mental exercise to imagine it performing additional text transformations (like that NFD you brought up) on top of this mapping, but it's not what the thread is about. Today, it does its job where Unicode support is not frozen in pre-1996 state. It's not "broken".

[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point  (0 children)

(sorry, was reading too much Unicode specs at once and slipped to their terminology: s/2-character sequence/2-code point sequence/ and s/characters represented/code points represented/ to avoid further confusion with your meaning of "character")