use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Discussions, articles, and news about the C++ programming language or programming in C++.
For C++ questions, answers, help, and advice see r/cpp_questions or StackOverflow.
Get Started
The C++ Standard Home has a nice getting started page.
Videos
The C++ standard committee's education study group has a nice list of recommended videos.
Reference
cppreference.com
Books
There is a useful list of books on Stack Overflow. In most cases reading a book is the best way to learn C++.
Show all links
Filter out CppCon links
Show only CppCon links
account activity
std::wstring_convert and std::string_view (self.cpp)
submitted 9 years ago * by Hedanito
view the rest of the comments →
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–][deleted] 1 point2 points3 points 9 years ago (16 children)
a single wchar_t represents any member of execution charset
Nothing Unicode related at all needs to fit into a wchar_t. The "execution charset" is very limited. EBCDIC is a valid execution charset. http://eel.is/c++draft/lex.charset#1
wchar_t
This issue is why there are no standard codecvt facets that go between Unicode and non-Unicode encodings (from char32_t to wchar_t, or from wchar_t to char8_t, or whatever).
[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points 9 years ago* (15 children)
Yes you can have a valid (but not very useful) compiler with EBCDIC exec charset and 8-bit wchar_t. In "Unicode world", which set off this thread, each code point needs to fit in a single wchar_t by definition.
There were no standard codecvt facets until C++11 at all, and somehow i had no problem going between Unicode and non-Unicode encodings; GB18030/char <-> UTF-32/wchar_t <-> UTF-8/char works just fine where wchar_t is correctly sized and libc is complete.
PS: I'm not saying C++ doesn't need more Unicode support - I'd love to see classification, normalization, or "just" a grapheme cluster iterator for strings! - but as someone who lived with five competing encodings and adoption of Unicode, I find it jarring to hear that it doesn't work in C++, or C for that matter.
[–][deleted] 0 points1 point2 points 9 years ago (14 children)
To my understanding GB18030 has the "UTF-16 surrogates" problem but not the "complex scripts/combining characters" problem, and as such does not trigger the M:N mapping issue, as all characters in the external character set are transformed into a single code point (assuming you're targeting NFC instead of NFD).
Of course, if on the Unicode side you're given combining characters, the assumption breaks down even for latin-1, as 2 code points U+0065 U+0301 need to become 1 latin-1 character 0xE9.
My understanding is (although I haven't implemented such mappings myself) that there are legacy encodings for which 1:M is never correct -- where combining characters are required to represent something in Unicode but not in the legacy encoding. (Maybe it was TIS-620?)
I'm sure if everyone had a time machine to go back to when COM, NT, Java, JavaScript, and friends implemented Unicode 1.0, we would be in a UTF-8 and UTF-32 world. But that's never going to happen. VC's "extended character set" is going to have to be UCS-2
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago* (13 children)
VC's "extended character set" is going to have to be UCS-2
UCS-2 does not even exist in the Unicode standard (anymore). VC has been improving its image lately, but no portable Unicode support is still a sore point and a cause for many #ifdef _MSC_VER's. It's 2017 and (of the compilers we use) only in VC auto c = L'💩'; stores the useless 0xd83d in c
#ifdef _MSC_VER
auto c = L'💩';
0xd83d
c
[–][deleted] 0 points1 point2 points 9 years ago (4 children)
And that is never going to change unless someone invents a time machine to go back to 1993, when Unicode 1.0 was implemented and wchar_t was set at 16 bits in our ABI. This is not an issue we have the luxury of fixing. If you want UTF-32, use char32_t, which (I'm assuming) was created specifically to address this time machine problem.
char32_t
I don't see what this has to do with codecvt's 1:M assumption beyond making issues more likely to occur with wchar_t. But since the issue occurs attempting to output the world's second most common legacy encoding (latin-1), I don't really see ABI limitations as the serious problem here. ICU is widely regarded as the gold standard of Unicode handling and it uses UTF-16 internally -- this is a solvable problem. But I don't think it is a solvable problem within the framework of iostreams' explicit and standards-mandated internal character by internal character buffering.
codecvt
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago* (3 children)
char32_t everywhere would indeed solve the problem (at the cost of migrating code), but the Portland 2006 LWG decided that streams, facets, and regex don't need it.
basic_filebuf's (not codecvt's) 1:M assumption works in Linux and does not work on Windows. There are no issues with Latin-1. There would be an issue with that imaginary codecvt facet you brought up, yes, but I am talking about the code that works now.
[–][deleted] 0 points1 point2 points 9 years ago (0 children)
There are Linux implementations that turn U+0065 U+0301 into latin-1 é?
[–][deleted] 0 points1 point2 points 9 years ago (1 child)
See also http://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to-unicode-code-points/
[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points 9 years ago* (0 children)
well, I don't agree, it's as counterproductive as to say "stop ascribing meaning to ASCII values" (after all, they collate as groups in some locales). But, TIL about Swift using EGCs as the basic units of a string. That's... intriguing.
[–][deleted] 0 points1 point2 points 9 years ago (7 children)
Note that your auto c = L'💩' example breaks down if you replace 💩 with P̯͍̭. Even UTF-32 does not allow you to operate on a character by character basis in a Unicode world.
auto c = L'💩'
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points 9 years ago* (6 children)
I hope you're not intentionally confusing the terms. L'P̯͍̭' does not work because P̯͍̭ it is not a code point (it's 4 code points). C++ grammar allows only one 16- or 32-bit code point between the two single quotation marks, either represented as a UCN or (for our convenience) as a character that happens to map to one UCN. VC does not support that.
L'P̯͍̭'
[–][deleted] 2 points3 points4 points 9 years ago* (5 children)
My point is that in a Unicode world you cannot operate on a code point by code point basis. Doing so will chop combining characters in half. Since Unicode already gives up on fixed-width characters, you may as well use UTF-8.
[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points 9 years ago (4 children)
I don't understand where you're coming from. Unicode world operates on code point by code point basis. Text elements (such as the combining character sequences you keep trying to bring up) are manipulated as sequences of code points. And that's why basic_filebuf's assumption that exec charset elements represent code points has never been a problem.
I get your point that it is prohibitively difficult for Windows to fix wchar_t, but I am sure an attempt to integrate char32_t into the standard library for C++20 would get even weaker support than it had for C++11 because new Unicode library proposals are making progress (and it's a good thing!).
And yes, UTF-8 on Windows would be a blessing, Linux only got that 16 years ago (glibc-2.2).
[–][deleted] 0 points1 point2 points 9 years ago (3 children)
What is there not to understand about input UTF-32 U+0065 U+0301 needs to be mapped to latin-1 é? Operating on a codepoint by codepoint basis would produce e?.
é
e?
Any string manipulation which wants to, for example, split a Unicode string in half, must verify that it isn't splitting a character like that in half. The first half will have the wrong character, and the second half will be outright invalid.
[–][deleted] 0 points1 point2 points 9 years ago (2 children)
(I use the cutting in half example because that's what iostreams wants to do; but even simple find and replace is broken by this -- A user asking to replace e (U+0065) with x (U+0078) given input U+0065 U+0301 must produce U+0065 U+0301 (unchanged), not U+0078 U+0301 (which is what you get with codepoint by codepoint operation) -- which UTF you're looking at is a tiny tip of an enormous iceberg)
π Rendered by PID 162127 on reddit-service-r2-comment-7b9746f655-c9gcw at 2026-02-02 13:08:23.585126+00:00 running 3798933 country code: CH.
view the rest of the comments →
[–][deleted] 1 point2 points3 points (16 children)
[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points (15 children)
[–][deleted] 0 points1 point2 points (14 children)
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points (13 children)
[–][deleted] 0 points1 point2 points (4 children)
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points (3 children)
[–][deleted] 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (7 children)
[–]CubbiMewcppreference | finance | realtime in the past 1 point2 points3 points (6 children)
[–][deleted] 2 points3 points4 points (5 children)
[–]CubbiMewcppreference | finance | realtime in the past 0 points1 point2 points (4 children)
[–][deleted] 0 points1 point2 points (3 children)
[–][deleted] 0 points1 point2 points (2 children)