[SOLVED] Is there a way to map a utf16 byte sequence into the length a utf8 byte sequence would be for the same codepoints?

Is there a way to map a utf16 byte sequence into the length a utf8 byte sequence would be for the same codepoints?

I have a valid array of UTF-16LE encoded byte. Some are surrogates.

Is there a way to tell from their bits how many UTF-8 bytes would be needed?

I know I can do a conversion to UTF-8 and count, but I just need the length, so I wonder if it can be done with some simple range tests on byte values?

Solution

It is indeed possible to compute without conversion, though still an O(n) procedure as conversion would be.

In UTF-8, codepoints up to U+007F are one byte, then two bytes up to U+07FF, three bytes up to U+FFFF, then four bytes above U+FFFF. Coincidentally (as I don't think this was a design consideration for UTF-8/16), codepoints from U+0080 to U+07FF and above U+FFFF actually have the same length in UTF-8 and UTF-16. You can therefore compute the UTF-8 length from a UTF-16 string along the lines of:

int utf8Length = utf16String.ByteLength;
for (char c in utf16String) // assuming a language with a 2-byte char datatype
{
    if (c <= 0x7F) 
        utf8Length--;
    else if ((c > 0x07FF && c < 0xD800) || c > 0xDFFF) // excludes the surrogate range
        utf8Length++;
}
return utf8Length;

For simplicity, I used greater/lesser than comparisons, but given the byte values, a bitmask and checking against zero might perform better.