I see here that the C++ standard library now has typedefs of std::basic_string
like u8string
and u16string
, but I don't see any member functions or algorithms that know much of anything about Unicode.
Let's say I want to iterate over the "grapheme clusters" in a string stored as UTF-8. These are the things that humans view as "characters", even though they may be multiple bytes or even multiple 32bit code units (like emoji flags 🇨🇴). Am I right that std::u8string
has nothing for that and I need to use a library like ICU?
It appears the substr
member function would split a UTF-8 or UTF-16 character in half, so again I need to use ICU or something to make sure I don't do that.
Correct?
Am I right that
std::u8string
has nothing for that and I need to use a library like ICU?It appears the
substr
member function would split a UTF-8 or UTF-16 character in half, so again I need to use ICU or something to make sure I don't do that.
Yes, you are correct, on both counts. The standard C++ library has no concept of Unicode codepoints/graphemes, only of encoded codeunits stored in individual char
/wchar_t
/char16_t
/char32_t
elements of strings.