c++boostutf-8c++17boost-locale

How to convert a codepoint to utf-8?


I have some code that reads in an a unicode codepoint (as escaped in a string 0xF00).

Since im using , I'm speculating if the following is best (and correct) approach:

unsigned int codepoint{0xF00};
boost::locale::conv::utf_to_utf<char>(&codepoint, &codepoint+1);

?


Solution

  • As mentioned, a codepoint in this form is (conveniently) UTF-32, so what you're looking for is a transcoding.

    For a solution that does not rely on functions deprecated since C++17, and isn't really ugly, and which also does not require hefty third-party libraries, you can use the very lightweight UTF8-CPP (four small headers!) and its function utf8::utf32to8.

    It's going to look something like this:

    const uint32_t codepoint{0xF00};
    std::vector<unsigned char> result;
    
    try
    {
       utf8::utf32to8(&codepoint, &codepoint + 1, std::back_inserter(result));
    }
    catch (const utf8::invalid_code_point&)
    {
       // something
    }
    

    (There's also a utf8::unchecked::utf32to8, if you're allergic to exceptions.)

    (And consider reading into vector<char8_t> or std::u8string, since C++20).

    (Finally, note that I've specifically used uint32_t to ensure the input has the proper width.)

    I tend to use this library in projects until I need something a little heavier for other purposes (at which point I'll typically switch to ICU).