I am writing a small app which I need to test with utf-8 characters of different number of byte lengths.
I can input unicode characters to test that are encoded in utf-8 with 1,2 and 3 bytes just fine by doing, for example:
string in = "pi = \u3a0";
But how do I get a unicode character that is encoded with 4-bytes? I have tried:
string in = "aegan check mark = \u10102";
Which as far as I understand should be outputting . But when I print that out I get ᴶ0
What am I missing?
EDIT:
I got it to work by adding leading zeros:
string in = "\U00010102";
Wish I had thought of that sooner :)
There's a longer form of escape in the pattern \U
followed by eight digits, rather than \u
followed by four digits. This is also used in Java and Python, amongst others:
>>> '\xf0\x90\x84\x82'.decode("UTF-8")
u'\U00010102'
However, if you are using byte strings, why not just escape each byte like above, rather than relying on the compiler to convert the escape to a UTF-8 string? This would seem to be more portable as well - if I compile the following program:
#include <iostream>
#include <string>
int main()
{
std::cout << "narrow: " << std::string("\uFF0E").length() <<
" utf8: " << std::string("\xEF\xBC\x8E").length() <<
" wide: " << std::wstring(L"\uFF0E").length() << std::endl;
std::cout << "narrow: " << std::string("\U00010102").length() <<
" utf8: " << std::string("\xF0\x90\x84\x82").length() <<
" wide: " << std::wstring(L"\U00010102").length() << std::endl;
}
On win32 with my current options cl gives:
warning C4566: character represented by universal-character-name '\UD800DD02' cannot be represented in the current code page (932)
The compiler tries to convert all unicode escapes in byte strings to the system code page, which unlike UTF-8 cannot represent all unicode characters. Oddly it has understood that \U00010102
is \uD800\uDD02
in UTF-16 (its internal unicode representation) and mangled the escape in the error message...
When run, the program prints:
narrow: 2 utf8: 3 wide: 1
narrow: 2 utf8: 4 wide: 2
Note that the UTF-8 bytestrings and the wide strings are correct, but the compiler failed to convert "\U00010102"
, giving the byte string "??"
, an incorrect result.