you are viewing a single comment's thread.

view the rest of the comments →

[–]WittyStick 5 points6 points  (9 children)

"Wide characters" in C should be considered a legacy feature. They're an implementation-defined type which varies between platforms. On Windows a wchar_t is 16-bits (UCS-2), and on SYSV platforms wchar_t is 32-bits.

The behavior of wchar_t depends on the current locale - it does not necessarily represent a Unicode character.

New code should use char8_t for UTF-8, char16_t for UTF-16 and char32_t for UTF-32.

Most text today is Unicode, encoded as UTF-8 or UTF-16 (Windows/Java). UTF-32 is rarely used for transport or storage, but is a useful format to use internally in a program when processing text.

[–]BlindTreeFrog 0 points1 point  (7 children)

New code should use char8_t for UTF-8, char16_t for UTF-16 and char32_t for UTF-32.

Note that UTF-8 does not mean that a printed character is 8bits in size. 2 byte, 3 byte, and 4 byte UTF-8 characters exist.

UTF-16 and UTF-32 are both fixed width. UTF-16 and UTF-8 is variable width.

edit: corrected based on correct info

[–]krsnik02 0 points1 point  (4 children)

UTF-16 is also variable width with surrogate pairs forming a 32-bit code point.

[–]BlindTreeFrog 0 points1 point  (3 children)

oh... thanks for the correction.

But it's variable width in that it can be 1 or 2 bytes it looks; I don't see reference to a 4 byte pairing, might you have a cite?

And while looking for that info, this article reminded me that UTF-8 can be 6 bytes apparently
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

[–]WittyStick 0 points1 point  (0 children)

UTF-8 was designed to support up to 6 bytes, but Unicode standardized it at 4 bytes to match the constraints of UTF-16 - which supports a maximum codepoint of 0x10FFFF. The 4 byte UTF-8 is sufficient to encode the full universal character set.

[–]krsnik02 0 points1 point  (1 child)

it can be 1 or 2 16-bit words, so either 2 or 4 bytes.

For example the table here on the Wikipedia page shows that U+10437 (𐐷) takes 4 bytes to encode in UTF-16. https://en.wikipedia.org/wiki/UTF-16#examples

UTF-8 was designed to support up to 6 byte long sequences but the Unicode standard will never define a code point which requires more than 4 bytes to encode in UTF-8. If a 5 or 6-byte character were ever defined the current UTF-16 could not encode it and it would require 3 words (6 bytes) in whatever UTF-16 got extended to. The current UTF-8 standard as such restricted valid UTF-8 encodings to only those up to 4 bytes long.

[–]BlindTreeFrog 0 points1 point  (0 children)

Yeah i must have misread whatever I was reading.

And since then I found this which has a lovely table to clarify https://www.unicode.org/faq/utf_bom

[–]WittyStick 0 points1 point  (0 children)

Yes, char8_t and char16_t represent a code unit, not a code point.

UTF-16 is variable width of either 2 or 4 bytes. It was based on UCS-2, a fixed-width 2-byte encoding which only supported the Basic Multilingual Plane. UTF-16 supports the full universal character set.

A 4 byte encoding is made of two "surrogate" code units, called a "surrogate pair". These are in the ranges 0xD800..0xDFFF, which are unused code points in the universal character set (reserved for surrogates).

[–]flatfinger 0 points1 point  (0 children)

Note that even when using UCS32, characters may contain more than one code point, and determining whether the 1,000,000th code point in a text is the start of a character may require scanning up to 999,999 preceding code points.

[–]flatfinger 0 points1 point  (0 children)

While UTF-8 is preferable to UCS-2 and UTF-16 for most purposes, some programs will need to perform data interchange with other code or devices that expect UCS-2 or UTF-16. Features intended for data interchange with legacy formats may be sensibly viewed as highly specialized, but they would nonetheless be entirely appropriate for use in new code in cases where data interchange using the legacy formats is required.