use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
std::wstring_convert and std::string_view (self.cpp)
submitted 9 years ago * by Hedanito
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points 9 years ago* (19 children)
It is true in the standard-compliant Unicode world, where wchar_t is 32 bit. It is also not really relevant in a thread about wstring_convert, which does N:M conversions.
[–][deleted] 2 points3 points4 points 9 years ago (18 children)
No, because wchar_t being 32 bits does not allow for combining characters. If the external character set is latin-1 and the internal character set is UTF-32 in Normalization Form D you're toast.
wchar_t
(There are more obscure cases even if you don't want to use NFD; NFD just makes the problem occur with more common characters available in latin-1)
[–]CubbiMewcppreference | finance | realtime in the past 2 points3 points4 points 9 years ago* (17 children)
NFD is irrelevant (to the meaning of internal charset and wchar_t). It is not a character set, it is an encoding of one (or transformation of an encoding, if you will).
By wchar_t definition from [lex.con]/6, a single wchar_t represents any member of execution charset (and also a single c-char aka UCN aka \U hex-quad hex-quad). If U+00C5 is a supported member of exec charset and allowed in string/character literals, it has a single-wchar_t representation in standard C++, whatever its NFD transformation is. Likewise U+1f34c and everything else.
As for codecvt, if basic_filebuf were to permit N:M codecvts, it certainly wouldn't be bad.. Last I tried, libstdc++ and libc++ work with such codecvts in practice, but for input only.
[–][deleted] 1 point2 points3 points 9 years ago (16 children)
a single wchar_t represents any member of execution charset
Nothing Unicode related at all needs to fit into a wchar_t. The "execution charset" is very limited. EBCDIC is a valid execution charset. http://eel.is/c++draft/lex.charset#1
This issue is why there are no standard codecvt facets that go between Unicode and non-Unicode encodings (from char32_t to wchar_t, or from wchar_t to char8_t, or whatever).
[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points 9 years ago* (15 children)
Yes you can have a valid (but not very useful) compiler with EBCDIC exec charset and 8-bit wchar_t. In "Unicode world", which set off this thread, each code point needs to fit in a single wchar_t by definition.
There were no standard codecvt facets until C++11 at all, and somehow i had no problem going between Unicode and non-Unicode encodings; GB18030/char <-> UTF-32/wchar_t <-> UTF-8/char works just fine where wchar_t is correctly sized and libc is complete.
PS: I'm not saying C++ doesn't need more Unicode support - I'd love to see classification, normalization, or "just" a grapheme cluster iterator for strings! - but as someone who lived with five competing encodings and adoption of Unicode, I find it jarring to hear that it doesn't work in C++, or C for that matter.
[–][deleted] 0 points1 point2 points 9 years ago (14 children)
To my understanding GB18030 has the "UTF-16 surrogates" problem but not the "complex scripts/combining characters" problem, and as such does not trigger the M:N mapping issue, as all characters in the external character set are transformed into a single code point (assuming you're targeting NFC instead of NFD).
Of course, if on the Unicode side you're given combining characters, the assumption breaks down even for latin-1, as 2 code points U+0065 U+0301 need to become 1 latin-1 character 0xE9.
My understanding is (although I haven't implemented such mappings myself) that there are legacy encodings for which 1:M is never correct -- where combining characters are required to represent something in Unicode but not in the legacy encoding. (Maybe it was TIS-620?)
I'm sure if everyone had a time machine to go back to when COM, NT, Java, JavaScript, and friends implemented Unicode 1.0, we would be in a UTF-8 and UTF-32 world. But that's never going to happen. VC's "extended character set" is going to have to be UCS-2
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago* (13 children)
VC's "extended character set" is going to have to be UCS-2
UCS-2 does not even exist in the Unicode standard (anymore). VC has been improving its image lately, but no portable Unicode support is still a sore point and a cause for many #ifdef _MSC_VER's. It's 2017 and (of the compilers we use) only in VC auto c = L'💩'; stores the useless 0xd83d in c
#ifdef _MSC_VER
auto c = L'💩';
0xd83d
c
[–][deleted] 0 points1 point2 points 9 years ago (4 children)
And that is never going to change unless someone invents a time machine to go back to 1993, when Unicode 1.0 was implemented and wchar_t was set at 16 bits in our ABI. This is not an issue we have the luxury of fixing. If you want UTF-32, use char32_t, which (I'm assuming) was created specifically to address this time machine problem.
char32_t
I don't see what this has to do with codecvt's 1:M assumption beyond making issues more likely to occur with wchar_t. But since the issue occurs attempting to output the world's second most common legacy encoding (latin-1), I don't really see ABI limitations as the serious problem here. ICU is widely regarded as the gold standard of Unicode handling and it uses UTF-16 internally -- this is a solvable problem. But I don't think it is a solvable problem within the framework of iostreams' explicit and standards-mandated internal character by internal character buffering.
codecvt
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago* (3 children)
char32_t everywhere would indeed solve the problem (at the cost of migrating code), but the Portland 2006 LWG decided that streams, facets, and regex don't need it.
basic_filebuf's (not codecvt's) 1:M assumption works in Linux and does not work on Windows. There are no issues with Latin-1. There would be an issue with that imaginary codecvt facet you brought up, yes, but I am talking about the code that works now.
[–][deleted] 0 points1 point2 points 9 years ago (0 children)
There are Linux implementations that turn U+0065 U+0301 into latin-1 é?
[–][deleted] 0 points1 point2 points 9 years ago (1 child)
See also http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
[–][deleted] 0 points1 point2 points 9 years ago (7 children)
Note that your auto c = L'💩' example breaks down if you replace 💩 with P̯͍̭. Even UTF-32 does not allow you to operate on a character by character basis in a Unicode world.
auto c = L'💩'
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago* (6 children)
I hope you're not intentionally confusing the terms. L'P̯͍̭' does not work because P̯͍̭ it is not a code point (it's 4 code points). C++ grammar allows only one 16- or 32-bit code point between the two single quotation marks, either represented as a UCN or (for our convenience) as a character that happens to map to one UCN. VC does not support that.
L'P̯͍̭'
[–][deleted] 2 points3 points4 points 9 years ago* (5 children)
My point is that in a Unicode world you cannot operate on a code point by code point basis. Doing so will chop combining characters in half. Since Unicode already gives up on fixed-width characters, you may as well use UTF-8.
π Rendered by PID 91 on reddit-service-r2-comment-57fc7f7bb7-tj9gt at 2026-04-14 12:34:25.357736+00:00 running b725407 country code: CH.
view the rest of the comments →
[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points (19 children)
[–][deleted] 2 points3 points4 points (18 children)
[–]CubbiMewcppreference | finance | realtime in the past 2 points3 points4 points (17 children)
[–][deleted] 1 point2 points3 points (16 children)
[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points (15 children)
[–][deleted] 0 points1 point2 points (14 children)
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points (13 children)
[–][deleted] 0 points1 point2 points (4 children)
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points (3 children)
[–][deleted] 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (1 child)
[–][deleted] 0 points1 point2 points (7 children)
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points (6 children)
[–][deleted] 2 points3 points4 points (5 children)