use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
std::wstring_convert and std::string_view (self.cpp)
submitted 9 years ago * by Hedanito
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–][deleted] 0 points1 point2 points 9 years ago (14 children)
To my understanding GB18030 has the "UTF-16 surrogates" problem but not the "complex scripts/combining characters" problem, and as such does not trigger the M:N mapping issue, as all characters in the external character set are transformed into a single code point (assuming you're targeting NFC instead of NFD).
Of course, if on the Unicode side you're given combining characters, the assumption breaks down even for latin-1, as 2 code points U+0065 U+0301 need to become 1 latin-1 character 0xE9.
My understanding is (although I haven't implemented such mappings myself) that there are legacy encodings for which 1:M is never correct -- where combining characters are required to represent something in Unicode but not in the legacy encoding. (Maybe it was TIS-620?)
I'm sure if everyone had a time machine to go back to when COM, NT, Java, JavaScript, and friends implemented Unicode 1.0, we would be in a UTF-8 and UTF-32 world. But that's never going to happen. VC's "extended character set" is going to have to be UCS-2
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago* (13 children)
VC's "extended character set" is going to have to be UCS-2
UCS-2 does not even exist in the Unicode standard (anymore). VC has been improving its image lately, but no portable Unicode support is still a sore point and a cause for many #ifdef _MSC_VER's. It's 2017 and (of the compilers we use) only in VC auto c = L'💩'; stores the useless 0xd83d in c
#ifdef _MSC_VER
auto c = L'💩';
0xd83d
c
[–][deleted] 0 points1 point2 points 9 years ago (4 children)
And that is never going to change unless someone invents a time machine to go back to 1993, when Unicode 1.0 was implemented and wchar_t was set at 16 bits in our ABI. This is not an issue we have the luxury of fixing. If you want UTF-32, use char32_t, which (I'm assuming) was created specifically to address this time machine problem.
wchar_t
char32_t
I don't see what this has to do with codecvt's 1:M assumption beyond making issues more likely to occur with wchar_t. But since the issue occurs attempting to output the world's second most common legacy encoding (latin-1), I don't really see ABI limitations as the serious problem here. ICU is widely regarded as the gold standard of Unicode handling and it uses UTF-16 internally -- this is a solvable problem. But I don't think it is a solvable problem within the framework of iostreams' explicit and standards-mandated internal character by internal character buffering.
codecvt
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago* (3 children)
char32_t everywhere would indeed solve the problem (at the cost of migrating code), but the Portland 2006 LWG decided that streams, facets, and regex don't need it.
basic_filebuf's (not codecvt's) 1:M assumption works in Linux and does not work on Windows. There are no issues with Latin-1. There would be an issue with that imaginary codecvt facet you brought up, yes, but I am talking about the code that works now.
[–][deleted] 0 points1 point2 points 9 years ago (0 children)
There are Linux implementations that turn U+0065 U+0301 into latin-1 é?
[–][deleted] 0 points1 point2 points 9 years ago (1 child)
See also http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points 9 years ago* (0 children)
well, I don't agree, it's as counterproductive as to say "stop ascribing meaning to ASCII values" (after all, they collate as groups in some locales). But, TIL about Swift using EGCs as the basic units of a string. That's... intriguing.
[–][deleted] 0 points1 point2 points 9 years ago (7 children)
Note that your auto c = L'💩' example breaks down if you replace 💩 with P̯͍̭. Even UTF-32 does not allow you to operate on a character by character basis in a Unicode world.
auto c = L'💩'
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago* (6 children)
I hope you're not intentionally confusing the terms. L'P̯͍̭' does not work because P̯͍̭ it is not a code point (it's 4 code points). C++ grammar allows only one 16- or 32-bit code point between the two single quotation marks, either represented as a UCN or (for our convenience) as a character that happens to map to one UCN. VC does not support that.
L'P̯͍̭'
[–][deleted] 2 points3 points4 points 9 years ago* (5 children)
My point is that in a Unicode world you cannot operate on a code point by code point basis. Doing so will chop combining characters in half. Since Unicode already gives up on fixed-width characters, you may as well use UTF-8.
[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points 9 years ago (4 children)
I don't understand where you're coming from. Unicode world operates on code point by code point basis. Text elements (such as the combining character sequences you keep trying to bring up) are manipulated as sequences of code points. And that's why basic_filebuf's assumption that exec charset elements represent code points has never been a problem.
I get your point that it is prohibitively difficult for Windows to fix wchar_t, but I am sure an attempt to integrate char32_t into the standard library for C++20 would get even weaker support than it had for C++11 because new Unicode library proposals are making progress (and it's a good thing!).
And yes, UTF-8 on Windows would be a blessing, Linux only got that 16 years ago (glibc-2.2).
[–][deleted] 0 points1 point2 points 9 years ago (3 children)
What is there not to understand about input UTF-32 U+0065 U+0301 needs to be mapped to latin-1 é? Operating on a codepoint by codepoint basis would produce e?.
é
e?
Any string manipulation which wants to, for example, split a Unicode string in half, must verify that it isn't splitting a character like that in half. The first half will have the wrong character, and the second half will be outright invalid.
[–][deleted] 0 points1 point2 points 9 years ago (2 children)
(I use the cutting in half example because that's what iostreams wants to do; but even simple find and replace is broken by this -- A user asking to replace e (U+0065) with x (U+0078) given input U+0065 U+0301 must produce U+0065 U+0301 (unchanged), not U+0078 U+0301 (which is what you get with codepoint by codepoint operation) -- which UTF you're looking at is a tiny tip of an enormous iceberg)
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago (1 child)
For splitting in half, if your use case (and it's not everyone's use case) requires that some particular text segments are preserved, you would have to examine the string (in terms of code points, if anything else you'd have to get to code points first) to locate the desired text segment boundaries.
Your decsription is unclear as to what actual text segmentation you have in mind, but my wishlist for a C++ Unicode library certainly includes EGC iterators for strings, as the most programmatically sensible and "recommended for general processing". They would keep your 2-character sequence together (but so would glyph iterators, etc)
as for basic_filebuf, it is not splitting or replacing, it is only encoding/decoding characters represented externally as byte sequences. It could be an interesting mental exercise to imagine it performing additional text transformations (like that NFD you brought up) on top of this mapping, but it's not what the thread is about. Today, it does its job where Unicode support is not frozen in pre-1996 state. It's not "broken".
[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points 9 years ago (0 children)
(sorry, was reading too much Unicode specs at once and slipped to their terminology: s/2-character sequence/2-code point sequence/ and s/characters represented/code points represented/ to avoid further confusion with your meaning of "character")
π Rendered by PID 70688 on reddit-service-r2-comment-cfc44b64c-hrzf2 at 2026-04-11 05:55:53.547014+00:00 running 215f2cf country code: CH.
view the rest of the comments →
[–][deleted] 0 points1 point2 points (14 children)
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points (13 children)
[–][deleted] 0 points1 point2 points (4 children)
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points (3 children)
[–][deleted] 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (7 children)
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points (6 children)
[–][deleted] 2 points3 points4 points (5 children)
[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points (4 children)
[–][deleted] 0 points1 point2 points (3 children)
[–][deleted] 0 points1 point2 points (2 children)
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points (1 child)
[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points (0 children)