cunicodeutf-8wchar-t

Convert universal character name to UTF-8 in C


I need to convert universal character name (UCN) data from a database to UTF-8. Seems trivial, but I spent hours reading about unicode, UTF-8, wide strings, ... without any result.

As example, the following string needs to be converted from D\u00c3\u00bcsseldorf to Düsseldorf.

What I tried:

char str[] = "\u00c3\u00bc"; // corresponds to ü
size_t str_len = strlen(str);
for (i = 0; i < str_len; i++)
    printf("%02hhx ", str[i]);
printf("- %zu - %s\n", str_len, str); // prints "c3 83 c2 bc - 4 - ü"

c3 is correct, but the next 3 bytes are unexpected.
The compiler only considers the first part of the UCN (\u00c3).

wchar_t wcs[] = L"\u00c3\u00bc";
size_t wcs_len = wcslen(wcs);
for (i = 0; i < wcs_len; i++)
    printf("%02hhx ", wcs[i]);
printf("- %zu - %ls\n", wcs_len, wcs); // prints "c3 bc - 2 - ü"

Looks better.
The entire UCN is considered (c3 bc), but still no ü.

char str[] = "\xc3\xbc";
size_t str_len = strlen(str);
for (i = 0; i < str_len; i++)
    printf("%02hhx ", str[i]);
printf("- %zu %s\n", str_len, str); // prints "c3 bc - 2 ü"

This prints the ü, but I modified str from UCN to hex code.

What am I missing to get from \u00c3\u00bc to ü?

--- UPDATE ---

Like Rob Napier described, I have to change the initial string literal since it was badly/double encoded. I believe the only solution would be to manually change to "D\u00c3\u00bcsseldorf" to "Düsseldorf" or "D\u00fcsseldorf". Both ways require manual change.

Changing it to "D\xc3\xbcsseldorf" produces the correct result "Düsseldorf", but only by coincidence because the byte following the second byte injection (\xbc) is non-hex (the letter s). "AAA\xc3\xbcBBB" gives "AAAû" (0x41 0x41 0x41 0xc3 0xbb). Too bad that \x in a string literal doesn't stop after 1 byte. See this.


Solution

  • char str[] = "\u00c3\u00bc"; // corresponds to ü
    

    This is where you went wrong. This is not ü. This is ü, just as is being output.

    The UCN for ü is \u00fc: LATIN SMALL LETTER U WITH DIAERESIS

    $ uni print c3 bc
         CPoint  Dec    UTF8        HTML       Name (Cat)
    '¼'  U+00BC  188    c2 bc       &frac14;   VULGAR FRACTION ONE QUARTER (Other_Number)
    'Ã'  U+00C3  195    c3 83       &Atilde;   LATIN CAPITAL LETTER A WITH TILDE (Uppercase_Letter)
    
    $ uni id ü
         CPoint  Dec    UTF8        HTML       Name (Cat)
    'ü'  U+00FC  252    c3 bc       &uuml;     LATIN SMALL LETTER U WITH DIAERESIS (Lowercase_Letter)
    

    Unicode code points (which are what UCN encode) assign a single number to each Unicode character. They are the identifier for the character, not the encoding.

    What you've written here is the UTF-8 encoding of ü. UTF-8 is a way of writing down Unicode code points. Except for ASCII values (0-127), the UTF-8 bytes are always very different from the code point's value. (UTF-8 is possibly the most clever and useful text encoding ever devised. But it is not trivial to understand.)

    If you want to hand-encode UTF-8, then the \x syntax is correct. You can inject arbitrary bytes into a C string that way. Generally you should prefer the \u00fc syntax when expressing a character, however.

    The reason your first byte seemed correct is that the UTF-8 encoding of à is c3 83. "c3" is the first byte of the UTF-8 encoding of many modified Latin characters. Seeing a lot of c3 bytes is an easy way to detect Western European UTF-8 text.