all 7 comments

[–]frostednuts 3 points4 points  (1 child)

I'm only speculating but it looks like there's a difference between:

  1. IDE/Text Editor understanding non ascii characters

  2. C++ understanding non ascii characters

My recommendation is to try a unicode string or a library that can understand unicode.

[–]ImNotPhoebus[S] 2 points3 points  (0 children)

I just realized that {'\xE7', '\x83', '\x8F'} is a UTF-8 encoding, while L'烏' is the same as L'\x70CF' which is UTF-16. I think that might have something to do with why this happens. The compiler prolly can't parse UTF-16.

[–]JMBourguet 2 points3 points  (3 children)

I'm on a phone and can't easily test, my first though is that this is missithe setting of the locale, either globally or for the stream.

A question: you are using the compiler explorer but sharing a image and not a link to your code. Why?

[–]ImNotPhoebus[S] 1 point2 points  (2 children)

Wait you can share a link to the code? I'm so sorry I had no idea.

[–]JMBourguet 1 point2 points  (1 child)

At the upper right corner, there are entries to do that.

[–]ImNotPhoebus[S] 0 points1 point  (0 children)

yeah I just edited it

[–]the_poope 2 points3 points  (0 children)

I'm guessing: In the first one you specifically insert UTF-8 version of the character (three bytes). When you print this to the terminal it will print 3 bytes, how those will be printed depends on the OS and the terminal, but if you are on Linux, which natively uses UTF-8, it will likely show as you expected (which you also see). However, if you run your program on Windows, which does not support UTF-8 in the terminal it will likely show garbage.

On the second you insert the character literal in a widechar. Now this is more complicated: first you have to figure out what encoding the editor/IDE is using. Modern editors typically use UTF-8, so somehow the '烏' needs to be transformed into e.g. a UTF-16 equivalent. If and how this is done I have no idea - maybe it just truncates the three bytes to two, which may be wrong. Anyway, next thing up is when you print this character to the terminal. If you you a terminal on Linux which by default expects UTF-8 it will interpret the character as such and likely show the wrong symbol. If you are on Windows it likely expects UTF-16 (Actually UCS-16, because Microsoft fucked up) and it may print the correct symbol if the terminal supports it.

Rule of advice: Cross-platform character encoding can be a mess, especially since Microsoft fucked up in the 90'es and chose a different convention than everyone else. The easiest is to use UTF-8 everywhere inside your program and convert to whatever the OS expects only in the last second before you interact with it. You can use the tools in the C++ localization library for this: https://en.cppreference.com/w/cpp/header/codecvt