c++qtunicodeqstringqchar

In Qt, how do I convert the Unicode codepoint U+1F64B to a QString holding its equivalent character "🙋"?


Background:

I am making a hash that will allow you to lookup the description you see below by feeding it a QString containing its character.

Character map example

I got a full list of the relevant data, looking something like this:

QHash<QString, QString> lookupCharacterDescription;
...
lookupCharacterDescription.insert("003F","QUESTION MARK");
lookupCharacterDescription.insert("0040","COMMERCIAL AT");
lookupCharacterDescription.insert("0041","LATIN CAPITAL LETTER A");
lookupCharacterDescription.insert("0042","LATIN CAPITAL LETTER B");
...
lookupCharacterDescription.insert("1F648","SEE-NO-EVIL MONKEY");
lookupCharacterDescription.insert("1F649","HEAR-NO-EVIL MONKEY");
lookupCharacterDescription.insert("1F64A","SPEAK-NO-EVIL MONKEY");
lookupCharacterDescription.insert("1F64B","HAPPY PERSON RAISING ONE HAND");
...
lookupCharacterDescription.insert("FFFD","REPLACEMENT CHARACTER");
lookupCharacterDescription.insert("FFFE","<not a character>");
lookupCharacterDescription.insert("FFFF","<not a character>");
lookupCharacterDescription.insert("FFFFE","<not a character>");
lookupCharacterDescription.insert("FFFFF","<not a character>");

Now obviously "1F64B" needs to be wrapped in something here. I have tried playing around with things like 0x1F64B as a QChar, but I am honestly groping in the dark here. I could make it work with the lower values like the Latin Letters, but it fails with the 5 character addresses.

Questions:


Solution

  • When you use QString(0x1F64B) it'll call QString::QString(QChar ch). Since QChar is a 16-bit type, it'll truncate the value to 0xF64B and you get an invalid character since that code point is currently unassigned. I'm pretty sure you'll get an out-of-range warning at that line. You can see the value F64B easily in the character ļ™‹ if you zoom in or use a hex editor. Since there's no way for 0x1F64B to fit into a single 16-bit QChar and must be represented by a surrogate pair, you can't initialize the string that way.

    OTOH QString("šŸ™‹") works since it's constructing the string from another string. You must construct the string with a string like that, or manually by assigning the UTF-8/16 code units.

    Is this considered UTF-32?

    No. UTF-32 is a Unicode encoding that uses 32 bits for a code unit. You only have QString and not a bare byte array, so you don't need to care about its underlying encoding (which is actually UTF-16)

    What can I wrap this value "1F64B" in to produce the QString("šŸ™‹")?

    You shouldn't deal with the numeric values as string. Store it as a numeric type instead

    QHash<qint32, QString> lookupCharacterDescription;
    lookupCharacterDescription.insert(0x1F64B, "HAPPY PERSON RAISING ONE HAND");
    

    and then to make a string that contains the character at code point 0x1F64B use

    uint cp = 0x1F64B;
    QString mystr = QString::fromUcs4(&cp, 1);
    

    Will the wrappings also work for the lower values?

    Yes, since UCS4, A.K.A. UTF-32, can store any possible Unicode characters

    Alternatively you can construct the character from UTF-16 or UTF-8. U+1F64B is encoded in UTF-16 as D83D DE4B, or as F0 9F 99 8B in UTF-8, therefore you can use any of the below

    QChar utf16[2] = { 0xD38D, 0xDE4B };
    str1 = QString(utf16, 2);
    char* utf8[4] = { 0xF0, 0x9F, 0x99, 0x8B };
    str2 = QString::fromUtf8(utf8, 4);
    

    If you want to include the string in its literal form in source code then either of the following will work

    str1 = QString::fromWCharArray(L"\xD83D\xDE4B");
    str2 = QString::fromUtf8("\xF0\x9F\x99\x8B");
    

    If you have C++11 support then simply use the prefix u8, u and U for UTF-8, UTF-16 and UTF-32 respectively like this

    QString::fromUtf8(u8"šŸ™‹");
    
    QString::fromUtf16(u"šŸ™‹");
    QString::fromUtf16(u"\uD83D\uDE4B");
    QString::fromUtf16(u"\U0001F64B");
    
    QString::fromUcs4(U"šŸ™‹");
    QString::fromUcs4(U"\U0001F64B");
    QString::fromUcs4(U"šŸ™‹", 1);
    QString::fromUcs4(U"\U0001F64B", 1);
    

    Mandatory article to understand text and encodings: There Ain't No Such Thing as Plain Text