I need to convert universal character name (UCN) data from a database to UTF-8. Seems trivial, but I spent hours reading about unicode, UTF-8, wide strings, ... without any result.
As example, the following string needs to be converted from D\u00c3\u00bcsseldorf
to Düsseldorf
.
What I tried:
char str[] = "\u00c3\u00bc"; // corresponds to ü
size_t str_len = strlen(str);
for (i = 0; i < str_len; i++)
printf("%02hhx ", str[i]);
printf("- %zu - %s\n", str_len, str); // prints "c3 83 c2 bc - 4 - ü"
c3
is correct, but the next 3 bytes are unexpected.
The compiler only considers the first part of the UCN (\u00c3
).
wchar_t wcs[] = L"\u00c3\u00bc";
size_t wcs_len = wcslen(wcs);
for (i = 0; i < wcs_len; i++)
printf("%02hhx ", wcs[i]);
printf("- %zu - %ls\n", wcs_len, wcs); // prints "c3 bc - 2 - ü"
Looks better.
The entire UCN is considered (c3 bc
), but still no ü
.
char str[] = "\xc3\xbc";
size_t str_len = strlen(str);
for (i = 0; i < str_len; i++)
printf("%02hhx ", str[i]);
printf("- %zu %s\n", str_len, str); // prints "c3 bc - 2 ü"
This prints the ü
, but I modified str
from UCN to hex code.
What am I missing to get from \u00c3\u00bc
to ü
?
--- UPDATE ---
Like Rob Napier described, I have to change the initial string literal since it was badly/double encoded. I believe the only solution would be to manually change to "D\u00c3\u00bcsseldorf"
to "Düsseldorf"
or "D\u00fcsseldorf"
. Both ways require manual change.
Changing it to "D\xc3\xbcsseldorf"
produces the correct result "Düsseldorf"
, but only by coincidence because the byte following the second byte injection (\xbc
) is non-hex (the letter s
). "AAA\xc3\xbcBBB"
gives "AAAû"
(0x41 0x41 0x41 0xc3 0xbb
). Too bad that \x
in a string literal doesn't stop after 1 byte. See this.
char str[] = "\u00c3\u00bc"; // corresponds to ü
This is where you went wrong. This is not ü
. This is ü
, just as is being output.
The UCN for ü
is \u00fc
: LATIN SMALL LETTER U WITH DIAERESIS
$ uni print c3 bc
CPoint Dec UTF8 HTML Name (Cat)
'¼' U+00BC 188 c2 bc ¼ VULGAR FRACTION ONE QUARTER (Other_Number)
'Ã' U+00C3 195 c3 83 Ã LATIN CAPITAL LETTER A WITH TILDE (Uppercase_Letter)
$ uni id ü
CPoint Dec UTF8 HTML Name (Cat)
'ü' U+00FC 252 c3 bc ü LATIN SMALL LETTER U WITH DIAERESIS (Lowercase_Letter)
Unicode code points (which are what UCN encode) assign a single number to each Unicode character. They are the identifier for the character, not the encoding.
What you've written here is the UTF-8 encoding of ü
. UTF-8 is a way of writing down Unicode code points. Except for ASCII values (0-127), the UTF-8 bytes are always very different from the code point's value. (UTF-8 is possibly the most clever and useful text encoding ever devised. But it is not trivial to understand.)
If you want to hand-encode UTF-8, then the \x
syntax is correct. You can inject arbitrary bytes into a C string that way. Generally you should prefer the \u00fc
syntax when expressing a character, however.
The reason your first byte seemed correct is that the UTF-8 encoding of à is c3 83. "c3" is the first byte of the UTF-8 encoding of many modified Latin characters. Seeing a lot of c3 bytes is an easy way to detect Western European UTF-8 text.