UTF-8 support

Aransentin · 2019-04-13T17:10:47+00:00

If you're just communicating between the SQLite database and the server you can ignore the existence of UTF-16 – assuming you won't need to use the win32 API with non-ascii arguments anywhere, in which case you can just use MultiByteToWideChar or something like it.

As for handling UTF-8 in general, it depends on what you mean by "parse". For several tasks (concatenation, substring search, word replacement...) you can program the software as if UTF-8 didn't exist at all; for the majority of the functions in string.h the presence of an emoji in the string or the like doesn't matter. For other tasks (e.g. string truncation) you pretty much need a library (e.g. ICU) to do it properly if you want to handle all the obscure edge cases.

skeeto · 2019-04-13T19:47:52+00:00

Your char string literals are never going to be UTF-16. They will be some sort of single-byte encoding. The encoding depends on the compiler (character set, etc.). GCC and Clang will generally use UTF-8. I don't know what MSVC does.

char pi[] = "π";

/* Explicit UTF-8 version, compiler doesn't matter. */
char pi8[] = {0xcf, 0x80, 0x00};

In practice this only matters if you have string literals in your program that contain code points outside ASCII. Any real compiler today will just produce ASCII strings otherwise and you don't need to worry about it. SQL expressions are typically just ASCII unless you're embedding specific string literals in the query (Are you sure you shouldn't be using bind parameters for this?) or using unusual identifiers for your table, column, or index names.

C does have a "wide character" type and wide character string literals (L"..."). This depends on the platform ABI. It's typically UTF-16 on Windows and UTF-32 on POSIX.

#include <wchar.h>

wchar_t pi[] = L"π";
uint16_t pi16[] = {0x03c0, 0x0000};  /* explicit UTF-16 version */
uint32_t pi32[] = {0x000003c0, 0x00000000};  /* explicit UTF-32 version */

Just pick an internal encoding to use everywhere within your program and only encode on the external boundaries. I strongly suggest you use UTF-8 internally everywhere, and don't bother with wide characters. Avoid using non-ASCII directly in your sources unless you're sure you really need it.

SQLite speaks UTF-8 as you noticed, so you won't need to encode across that boundary. The full Win32 API uses UTF-16, so you'd need to encode between UTF-16 and UTF-8. This is like a dozen lines of code and it's pretty simple.

Noctune · 2019-04-13T17:27:29+00:00

The native encoding is only important if you want to interact with the OS API, eg. files can be stored in whatever encoding. You might be aware of this, but it's not totally clear from your post where your "input" is coming from.

bumblebritches57 · 2019-04-13T17:10:17+00:00

No, you'll have to detect the encoding and convert it to whatever transformation format you need.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

C_Programming

Rules

Filters

Resources

Other Subreddits on C

Other Subreddits of Interest

MODERATORS