c++unicodeunicode-string

C++ unicode strings - the basic_strings know nothing about Unicode?


I see here that the C++ standard library now has typedefs of std::basic_string like u8string and u16string, but I don't see any member functions or algorithms that know much of anything about Unicode.

Let's say I want to iterate over the "grapheme clusters" in a string stored as UTF-8. These are the things that humans view as "characters", even though they may be multiple bytes or even multiple 32bit code units (like emoji flags 🇨🇴). Am I right that std::u8string has nothing for that and I need to use a library like ICU?

It appears the substr member function would split a UTF-8 or UTF-16 character in half, so again I need to use ICU or something to make sure I don't do that.

Correct?


Solution

  • Am I right that std::u8string has nothing for that and I need to use a library like ICU?

    It appears the substr member function would split a UTF-8 or UTF-16 character in half, so again I need to use ICU or something to make sure I don't do that.

    Yes, you are correct, on both counts. The standard C++ library has no concept of Unicode codepoints/graphemes, only of encoded codeunits stored in individual char/wchar_t/char16_t/char32_t elements of strings.