c++unicodeicu

How to convert a Unicode code point to characters in C++ using ICU?


Somehow I couldn't find the answer in Google. Probably I'm using the wrong terminology when I'm searching. I'm trying to perform a simple task, convert a number that represents a character to the characters itself like in this table: http://unicode-table.com/en/#0460

For example, if my number is 47 (which is '\'), I can just put 47 in a char and print it using cout and I will see in the console a backslash (there is no problem for numbers lower than 256).

But if my number is 1120, the character should be 'Ѡ' (omega in Latin). I assume it is represented by several characters (which cout would know to convert to 'Ѡ' when it prints to the screen).

How do I get these "several characters" that represent 'Ѡ'?

I have a library called ICU, and I'm using UTF-8.


Solution

  • What you call Unicode number is typically called a code point. If you want to work with C++ and Unicode strings, ICU offers a icu::UnicodeString class. You can find the documentation here.

    To create a UnicodeString holding a single character, you can use the constructor that takes a code point in a UChar32:

    icu::UnicodeString::UnicodeString(UChar32 ch)
    

    Then you can call the toUTF8String method to convert the string to UTF-8.

    Example program:

    #include <iostream>
    #include <string>
    
    #include <unicode/unistr.h>
    
    int main() {
        icu::UnicodeString uni_str((UChar32)1120);
        std::string str;
        uni_str.toUTF8String(str);
        std::cout << str << std::endl;
    
        return 0;
    }
    

    On a Linux system like Debian, you can compile this program with:

    g++ so.cc -o so -licuuc
    

    If your terminal supports UTF-8, this will print an omega character.