I'm looking into using ICU for Unicode string processing in a native Node.js module because it seems to me that v8::String
(according to these docs) doesn't have a C++ API for this purpose.
To my knowledge V8 expects UTF-16 in ExternalStringResource
and other APIs, so I'd like to use ICU for UTF-16 processing. I specifically need to:
So I looked at the ICU documentation and found the UnicodeString
and CharacterIterator
classes. However, UnicodeString
doesn't have a fromUTF16
method, only fromUTF8
and fromUTF32
.
The other thing I'm unsure about is, does the UnicodeString
constructor copy the data I give it or not? I'd very much prefer to use a zero-copy approach where I'd just work with an immutable object so it shouldn't perform any copy operations, just use the buffer I point it at.
I'm also unsure if I can just use UCharIterator
(assuming I can somehow convert UChar*
from my UTF-16 strings).
So my question is: How do I use ICU for the above purposes?
UnicodeString
uses UTF-16 for storage by default. That's why it only has fromUTF8
and fromUTF32
: from UTF-16 there is no conversion to be made.
It does copy the data. It is an owning string, much like std::string
.
You can use UCharIterator
if you don't want to copy the data. UChar
is a 16-bit value. You can force it to be whatever 16-bit type you prefer working with by defining the UCHAR_TYPE
macro:
Define UChar to be UCHAR_TYPE, if that is #defined (for example, to char16_t), or wchar_t if that is 16 bits wide; always assumed to be unsigned.
If neither is available, then define UChar to be uint16_t.
This makes the definition of UChar platform-dependent but allows direct string type compatibility with platforms with 16-bit wchar_t types.