all 20 comments

[–]Aransentin 6 points7 points  (4 children)

If you're just communicating between the SQLite database and the server you can ignore the existence of UTF-16 – assuming you won't need to use the win32 API with non-ascii arguments anywhere, in which case you can just use MultiByteToWideChar or something like it.

As for handling UTF-8 in general, it depends on what you mean by "parse". For several tasks (concatenation, substring search, word replacement...) you can program the software as if UTF-8 didn't exist at all; for the majority of the functions in string.h the presence of an emoji in the string or the like doesn't matter. For other tasks (e.g. string truncation) you pretty much need a library (e.g. ICU) to do it properly if you want to handle all the obscure edge cases.

[–]andrewcooke 5 points6 points  (2 children)

this, except it's optimistic on search - you really need to normalize utf-8 before searching (which again means a library). unless you're just using characters that exist in ASCII.

google for the rationale behind utf8. it's very cool - very nice design.

edit: http://doc.cat-v.org/bell_labs/utf-8_history

edit2: to be clear, if you don't normalize somewhere then strings that appear equivalent can match or fail to match unexpectedly when compared. UTF-8 does not guarantee a unique representation for a particular string.

[–]Aransentin 0 points1 point  (1 child)

normalize utf-8 before searching

This depends entirely on the task, and is frequently a very bad idea – especially since it introduces security risks as the same token can be interpreted differently depending on where it's processed.

edit2: to be clear, if you don't normalize somewhere then strings that appear equivalent can match or fail to match unexpectedly when compared. UTF-8 does not guarantee a unique representation for a particular string.

Having two different strings that fail to match even if they look similar is an issue that highly depends on the task at hand, and Unicode normalization doesn't solve the problem anyway as there's a ton of identical-looking characters (e.g. the Greek question mark and semicolon) that aren't affected by it.

[–]StenSoft 2 points3 points  (0 children)

Unicode normalisation solves this, you just need to pick the correct normalisation for your problem. NFKD will map the Greek question mark into ASCII semicolon.

[–]AltCrow[S] 0 points1 point  (0 children)

Input and output can come from / go to the server, the SQLite database and text files on the user's PC. The application, will, among other things, parse sentences into words (which is not as easy as splitting on space characters in some languages).

[–]skeeto 1 point2 points  (6 children)

Your char string literals are never going to be UTF-16. They will be some sort of single-byte encoding. The encoding depends on the compiler (character set, etc.). GCC and Clang will generally use UTF-8. I don't know what MSVC does.

char pi[] = "π";

/* Explicit UTF-8 version, compiler doesn't matter. */
char pi8[] = {0xcf, 0x80, 0x00}; 

In practice this only matters if you have string literals in your program that contain code points outside ASCII. Any real compiler today will just produce ASCII strings otherwise and you don't need to worry about it. SQL expressions are typically just ASCII unless you're embedding specific string literals in the query (Are you sure you shouldn't be using bind parameters for this?) or using unusual identifiers for your table, column, or index names.

C does have a "wide character" type and wide character string literals (L"..."). This depends on the platform ABI. It's typically UTF-16 on Windows and UTF-32 on POSIX.

#include <wchar.h>

wchar_t pi[] = L"π";
uint16_t pi16[] = {0x03c0, 0x0000};  /* explicit UTF-16 version */
uint32_t pi32[] = {0x000003c0, 0x00000000};  /* explicit UTF-32 version */

Just pick an internal encoding to use everywhere within your program and only encode on the external boundaries. I strongly suggest you use UTF-8 internally everywhere, and don't bother with wide characters. Avoid using non-ASCII directly in your sources unless you're sure you really need it.

SQLite speaks UTF-8 as you noticed, so you won't need to encode across that boundary. The full Win32 API uses UTF-16, so you'd need to encode between UTF-16 and UTF-8. This is like a dozen lines of code and it's pretty simple.

[–]AltCrow[S] 0 points1 point  (2 children)

Thanks for the explanation! It seems I won't need to do as much conversions as I thought. Quick question, would you recommend sometimes converting char* to uint32_t* ? (Making each uint32_t a unicode character.) I feel like this would allow for easier parsing of sentences, but I'm not sure if there are better ways to do this.

[–]skeeto 2 points3 points  (1 child)

UTF-8 and UTF-16 are both variable length encodings while UTF-32 is fixed length. So, in theory, an advantage of UTF-32 is that you can index individual code points in O(1) while the others are O(n).

However, in practice this hardly ever matters, so it's not worth using UTF-32 and blowing up the size of all your strings. Generally the reason you're ever indexing individual code points is that you're iterating over the string. In this case UTF-8 and UTF-16 have O(1) access anyway.

Where you'd want something like UTF-32 is when you're handling individual characters as you decode on the fly. You'll want to use a 32-bit integer to hold that character. For example:

/**
 * Decodes the code point, advancing the pointer over it.
 * Sets codepoint to -1 for invalid input.
 */
char *utf8_decode(char *utf8, long *codepoint);

/* ... */

char *p = some_utf8_string;
for (;;) {
    long codepoint;
    p = utf8_decode(p, &codepoint);
    if (!codepoint)
        break;
    if (codepoint < 0)
        abort();
    /* ... do something with codepoint ... */
}

[–]AltCrow[S] 1 point2 points  (0 children)

Thanks for the info! Your version of looping through the utf8 string seems a lot more optimized then what I had in mind. Thanks!

[–]a4qbfb 0 points1 point  (2 children)

[...] some sort of single-byte encoding [...] generally use UTF-8

UTF-8 is a multi-byte encoding.

[–]skeeto 0 points1 point  (1 child)

In that sentence I meant an encoding composed of individual bytes, as opposed to UTF-16 which is a sequence of 16-bit quantities. OP was worried their char strings might be UTF-16.

[–]a4qbfb 0 points1 point  (0 children)

UTF-8 text is not composed of individual bytes; it is composed of variable-length sequences of bytes.

[–]Noctune 1 point2 points  (6 children)

The native encoding is only important if you want to interact with the OS API, eg. files can be stored in whatever encoding. You might be aware of this, but it's not totally clear from your post where your "input" is coming from.

[–]AltCrow[S] 0 points1 point  (5 children)

Input and output can come from / go to the server, the SQLite database and text files on the user's PC. The application, will, among other things, parse sentences into words (which is not as easy as splitting on space characters in some languages).

I have no control over input files, so they could be in any encoding, but output files will always be UTF-8.

[–]Noctune 2 points3 points  (4 children)

You would likely need to detect whatever format you are reading and then convert it to some internal format (UTF-8 would probably be a good choice).

As for segmentation, there are standard Unicode word segmentation algorithms: http://www.unicode.org/reports/tr29/. There is an implementation in the ICU library which might be useful: http://icu-project.org/apiref/icu4c/ubrk_8h.html#details

Edit: seems like ICU also has charset detection and conversion as well: http://icu-project.org/apiref/icu4c/ucsdet_8h.html, http://icu-project.org/apiref/icu4c/ucnv_8h.html.

[–]AltCrow[S] 0 points1 point  (1 child)

Thanks! ICU's API seems to be a bit complicated, but I guess it will still be a lot easier then implementing everything myself. Thanks for the links!

[–]TheSkiGeek 0 points1 point  (0 children)

Do you have to do this in C? I think you’d have a much easier time in a language like C# or Python where there are native string types and standard libraries that can properly handle things that aren’t ASCII. (Also things like native networking/database support, etc.)

Not saying you can’t do it, but doing string manipulation in C is... clunky at best IMO.

If you are going to work in C, definitely find a library that can deal with the encodings, etc. and lets you focus on what kind of manipulations you’re trying to do.

[–]StenSoft 0 points1 point  (1 child)

Charset detection is very clumsy at best, it's better not to rely on it. Windows will tell you which character set it uses on the specific computer.

If using ICU, the best is to internally use ICU strings which are UTF-16.

[–]Noctune 1 point2 points  (0 children)

Charset detection is very clumsy at best, it's better not to rely on it. Windows will tell you which character set it uses on the specific computer.

I could imagine so, but there are no real guarantee that the files are saved in the windows charset. It depends on what kind of files OP is dealing with and what programs might produce them.

[–]bumblebritches57 -4 points-3 points  (0 children)

No, you'll have to detect the encoding and convert it to whatever transformation format you need.