c++node.jsunicodeutf-16icu

How to use ICU with UTF-16?


I'm looking into using ICU for Unicode string processing in a native Node.js module because it seems to me that v8::String (according to these docs) doesn't have a C++ API for this purpose.

To my knowledge V8 expects UTF-16 in ExternalStringResource and other APIs, so I'd like to use ICU for UTF-16 processing. I specifically need to:

So I looked at the ICU documentation and found the UnicodeString and CharacterIterator classes. However, UnicodeString doesn't have a fromUTF16 method, only fromUTF8 and fromUTF32.

The other thing I'm unsure about is, does the UnicodeString constructor copy the data I give it or not? I'd very much prefer to use a zero-copy approach where I'd just work with an immutable object so it shouldn't perform any copy operations, just use the buffer I point it at.

I'm also unsure if I can just use UCharIterator (assuming I can somehow convert UChar* from my UTF-16 strings).

So my question is: How do I use ICU for the above purposes?


Solution

  • UnicodeString uses UTF-16 for storage by default. That's why it only has fromUTF8 and fromUTF32: from UTF-16 there is no conversion to be made.

    It does copy the data. It is an owning string, much like std::string.

    You can use UCharIterator if you don't want to copy the data. UChar is a 16-bit value. You can force it to be whatever 16-bit type you prefer working with by defining the UCHAR_TYPE macro:

    Define UChar to be UCHAR_TYPE, if that is #defined (for example, to char16_t), or wchar_t if that is 16 bits wide; always assumed to be unsigned.

    If neither is available, then define UChar to be uint16_t.

    This makes the definition of UChar platform-dependent but allows direct string type compatibility with platforms with 16-bit wchar_t types.