c++unicodeicuicu4c

How to copy a (ICU4C) Unicode string to another Unicode string character by character?


I'm trying to use ICU's StringCharacterIterator to copy (and possibly alter) characters from a source string to a destination string. However, I am having unexpected results and am unsure why.

I would expect the final line of output of this program to be dog but instead I get og￿

#include <iostream>
#include <icu4c/unicode/schriter.h>

int main()
{
    UnicodeString dog = UnicodeString::fromUTF8("dog");
    StringCharacterIterator chars(dog);

    UnicodeString copy;
    while(chars.hasNext())
        copy.append(chars.next32());

    for(int i=0; i<copy.countChar32(); i++)
    {
        int32_t charNumber = copy.char32At(i);
        std::cout << charNumber << "\n";
    }

    std::string stdString;
    copy.toUTF8String(stdString);
    std::cout << stdString;
}

Program Output

111
103
65535
og￿

Unicode table

111 - latin small letter o

103 - latin small letter g


Solution

  • You have two problems:

    1. StringCharacterIterator::hasNext returns false only when the iterator is beyond the end of the string.
    2. StringCharacterIterator::next32 advances the current position of the iterator and returns the new code point. It is analogous to *(++it) for a raw pointer or standard library style iterator.

    Taken together, this means you're skipping the first character of your string and reading an extra character beyond the end.

    You can use next32PostInc, which behaves like *(it++) for a raw pointer or standard library iterator, instead of next32:

    while(chars.hasNext())
        copy.append(chars.next32PostInc());