c++unicodeutf-8astral-plane

How do I input 4-byte UTF-8 characters?


I am writing a small app which I need to test with utf-8 characters of different number of byte lengths.

I can input unicode characters to test that are encoded in utf-8 with 1,2 and 3 bytes just fine by doing, for example:

string in = "pi = \u3a0";

But how do I get a unicode character that is encoded with 4-bytes? I have tried:

string in = "aegan check mark = \u10102";

Which as far as I understand should be outputting . But when I print that out I get ᴶ0

What am I missing?

EDIT:

I got it to work by adding leading zeros:

string in = "\U00010102";

Wish I had thought of that sooner :)


Solution

  • There's a longer form of escape in the pattern \U followed by eight digits, rather than \u followed by four digits. This is also used in Java and Python, amongst others:

    >>> '\xf0\x90\x84\x82'.decode("UTF-8")
    u'\U00010102'
    

    However, if you are using byte strings, why not just escape each byte like above, rather than relying on the compiler to convert the escape to a UTF-8 string? This would seem to be more portable as well - if I compile the following program:

    #include <iostream>
    #include <string>
    
    int main()
    {
        std::cout << "narrow: " << std::string("\uFF0E").length() <<
            " utf8: " << std::string("\xEF\xBC\x8E").length() <<
            " wide: " << std::wstring(L"\uFF0E").length() << std::endl;
    
        std::cout << "narrow: " << std::string("\U00010102").length() <<
            " utf8: " << std::string("\xF0\x90\x84\x82").length() <<
            " wide: " << std::wstring(L"\U00010102").length() << std::endl;
    }
    

    On win32 with my current options cl gives:

    warning C4566: character represented by universal-character-name '\UD800DD02' cannot be represented in the current code page (932)

    The compiler tries to convert all unicode escapes in byte strings to the system code page, which unlike UTF-8 cannot represent all unicode characters. Oddly it has understood that \U00010102 is \uD800\uDD02 in UTF-16 (its internal unicode representation) and mangled the escape in the error message...

    When run, the program prints:

    narrow: 2 utf8: 3 wide: 1
    narrow: 2 utf8: 4 wide: 2
    

    Note that the UTF-8 bytestrings and the wide strings are correct, but the compiler failed to convert "\U00010102", giving the byte string "??", an incorrect result.