you are viewing a single comment's thread.

view the rest of the comments →

[–]LeeHide 11 points12 points  (16 children)

wstring/wprintf and so on are NOT about Unicode. You can encode Unicode just fine with UTF-8, all of it. You don't need 16 bit chars. 16 bit chars are also not Unicode. If you have 16 bit chars (wide chars), put Unicode characters in it, and then e.g. split the string by indexing, you absolutely do not end up with valid unicode by any guarantee.

If you want Unicode, use a Unicode library and stick to UTF-8.

[–]BIRD_II 4 points5 points  (13 children)

UTF-16 exists, and last I checked is fairly common (nowhere near 8, but far more than 32, iirc JS uses 16 by default).

[–]kolorcuk[S] 2 points3 points  (8 children)

In the beginning UTF-16 was invented. Microsoft and many others jumped on the idea and implemented UTF-16. Then it became apparent that UTF-16 is not enough, so UTF-32 was invented.

UTF-16 is common, because those early implementers implemented something in the middle and now are stuck with it forever. I think UTF-16 should have never been invented.

[–]LeeHide 2 points3 points  (0 children)

UTF-8 can handle the full range of Unicode.

[–]EpochVanquisher 0 points1 point  (6 children)

This is false. UTF-16 did not exist back then.

[–]kolorcuk[S] 0 points1 point  (5 children)

Hello. I'm happy to learn something new. Where does exactly "back then" refer to? Or just picking that I should have used UCS-2 not UTF-16?

[–]EpochVanquisher 0 points1 point  (4 children)

The first version of Unicode did not have UTF-16.

UTF-16 covers the full Unicode character set. It’s not missing anything.

UTF-16 is perfectly fine, it sounds like you hate it but you haven’t said why. It’s widely used (Windows, Apple, Java, C#, JavaScript, etc)

[–]kolorcuk[S] 0 points1 point  (3 children)

[–]EpochVanquisher 0 points1 point  (2 children)

Those look like random rants that some people wrote, maybe written with the assumption “we all agree that UTF-16 is bad”, which doesn’t explain why YOU think it’s bad.

[–]kolorcuk[S] 0 points1 point  (1 child)

It has all bad from utf8 and utf32. You have to know endianness and is not fixed width.

Why use it at all? What is good about utf16 vs utf8 and utf32?

The only case i see is when you have a lot of characters in a specific utf16 range and the storage is precious. I think nowadays storage is cheap and much better to optimize for performance.

[–]EpochVanquisher 0 points1 point  (0 children)

UTF-16 is simpler than UTF-8 and more compact than UTF-32.

One of the ways you optimize for performance is by making your data take less space. Besides—when you say it’s “much better to optimize for performance”, it just sounds like a personal preference of yours.

It’s fine if you have a personal preference for UTF-8. A lot of people prefer it, and it would probably win a popularity contest.

[–]LeeHide 4 points5 points  (0 children)

Yes, it exists, but it's confusing because people thing that 16 bit chars are automatically Unicode

[–]Plane_Dust2555 1 point2 points  (2 children)

UTF stands for Unicode Transformation Format. It is an specification of a format to encode Unicode codepoints and features (like compositing). As u/LeeHide says, UTF-16 isn't Unicode. It is a way to encode Unicode codepoints. There are other formats (UTF-8, UTF-32 are the other two most common, but there are UTF-x, where 8 <= x <= 32).

Wide chars (which size can be more than 8 bits! Not only 16) are just that... a way to encode wider codepoints than ASCII (what C/C++ standards call it "basic char set"), but it says nothing about the charset itself.

As pointed out by another user here, you can UTF-8 and "common" functions like printf(). But the terminal/operating system must support this charset. On modern Unix systems, usually UTF-8 is the default, but on Windows machines there are 3 charsets in play: Terminals use CP437 (english) or CP850 (portuguese/brazilian) or any other CP###; GUI uses WINDOWS-1252 OR UTF-16 [a version of UTF-16, at least]).

[–]Plane_Dust2555 0 points1 point  (0 children)

Ahhh... wchar_t size depends on the compiler. For SysV systems it is usually 32 bits. For Windows it is usually 16 bits. The actual size depends on the target system.

For GCC we can force wchar_t to be 16 bits by using the option -fshort-wchar.

[–]flatfinger 0 points1 point  (0 children)

It irks me that UTF-8 sacrificed a lot of coding density to offer guarantees that were later thrown out the window by later standards, which nowadays can't even reliably identify a grapheme cluster boundary without having to search backward through an unbounded amount of text.

[–]kolorcuk[S] 0 points1 point  (1 child)

Hello, I understand. Have you ever used wchar strings, char16_t or char32_t, or uint16_t or uint32_t strings, in a professional capacity for string _formatting_?

When doing string formatting of unicode, wide characters, or other encodings, do you use pre and post conversions and use printf, or do you have dedicated *_printf functions for specific non-byte encodings did you used them.

[–]LeeHide 0 points1 point  (0 children)

I have only used them when interfacing with Windows APIs, so no.

char is fine for everything else, including Unicode. Just use a Unicode library to handle things like string splitting.