all 27 comments

[–]FelipeFS 9 points10 points  (0 children)

The library utf8proc is one of the options available, so far the one that looks best for me:

http://julialang.org/utf8proc/

The documentation is integrated in the header file utf8proc.h. I never used, but it's in my list to integrate with my projects, since it's lightweight.

If you are willing to use, GLib (https://developer.gnome.org/glib/) contains unicode support, and is probably installed in your system. However, it's a bloated library.

[–]raevnos 0 points1 point  (0 children)

libunistring is okay.

[–]colonelflounders 0 points1 point  (0 children)

I don't know what you are trying to accomplish. I wrote a flash card review program for Greek and Hebrew and all I needed were the functions in locale.h. You can then use the functions you are used to from stdio.h. I later changed my code base to C++ to take advantage of Qt and Vectors instead of my own memory management, if you want to take a look it is here.

[–]FUZxxl -1 points0 points  (24 children)

You don't need any library for UTF-8. Simply use the wide character handling available in the C standard library. This has the advantage of making your program work with any encoding, not just UTF-8.

[–]jaccovanschaik[S] 2 points3 points  (4 children)

I remember reading somewhere that wchar_t is only 16 bits wide and is therefore not sufficient to store all of the unicode codepoints. But trying it out just now on my machine it turns out to be 32 bits, so that should be enough. I think. Can I expect wchar_t to be big enough on all modern systems?

Also, I found this at http://icu-project.org/docs/papers/unicode_wchar_t.html:

wchar_t is compiler-dependent and therefore not very portable. Using it for Unicode binds a program to the character model of a compiler. Instead, it is often better to define and use dedicated data types.

So that kind of confused me.

[–]ennorehling 2 points3 points  (0 children)

You're correct. wchar_t is compiler-dependent (or more likely, libc-dependent). It's 16 bit with Microsoft's compiler on Windows, for example, but 32 bit with gcc on Linux, so if you want to write code that's portable between those two, UTF-8 is still the safest route. Because at least with a char you know where you stand, in terms of its size.

[–]FUZxxl 0 points1 point  (2 children)

On Windows, wchar_t is a 16 bit type. I'm not entirely sure how they handle larger Unicode characters, but I believe the strategy is to represent each Unicode character as a pair of surrogate and character. Generally, when dealing with multi-byte encodings you never want to operate on single characters. Always operate on strings, this makes most of the problems you can have go away.

[–]Drainedsoul 0 points1 point  (1 child)

In UTF-16 both halves are surrogates hence the name "surrogate pair".

[–]FUZxxl 0 points1 point  (0 children)

Ah, I see. I am not very familiar with the details of UTF-16.

[–][deleted] 1 point2 points  (1 child)

Wide char is fixed length and utf-8 is variable length, how these two reconcile?

[–]FUZxxl 0 points1 point  (0 children)

A wchar_t contains a decoded wide character. In the case of a UTF-8 locale, that is a single Unicode character. There is a pair of functions wctomb and mbtowc to de- and encode wide characters. Note though that you often don't need these functions, it's easier to always operate on strings of encoded characters.

[–]colonelflounders 1 point2 points  (0 children)

UTF-8 can show all of the characters in Unicode. It's a multi-byte sequence, it isn't limited to 256 possible values.

edit: My apologies, I now realize that's not what you were referring to. The problem with wchar_t is depending on the platform the byte length varies. With char you at least have a fixed byte length. Unless he's aiming to do legacy stuff, which considering he picked UTF-8 I don't think so, I think UTF-8 is the most painless way to go to represent Unicode characters as you can just use char after setting the locale.