utf-8utf-16

Is there a way to map a utf16 byte sequence into the length a utf8 byte sequence would be for the same codepoints?


I have a valid array of UTF-16LE encoded byte. Some are surrogates.

Is there a way to tell from their bits how many UTF-8 bytes would be needed?

I know I can do a conversion to UTF-8 and count, but I just need the length, so I wonder if it can be done with some simple range tests on byte values?


Solution

  • It is indeed possible to compute without conversion, though still an O(n) procedure as conversion would be.

    In UTF-8, codepoints up to U+007F are one byte, then two bytes up to U+07FF, three bytes up to U+FFFF, then four bytes above U+FFFF. Coincidentally (as I don't think this was a design consideration for UTF-8/16), codepoints from U+0080 to U+07FF and above U+FFFF actually have the same length in UTF-8 and UTF-16. You can therefore compute the UTF-8 length from a UTF-16 string along the lines of:

    int utf8Length = utf16String.ByteLength;
    for (char c in utf16String) // assuming a language with a 2-byte char datatype
    {
        if (c <= 0x7F) 
            utf8Length--;
        else if ((c > 0x07FF && c < 0xD800) || c > 0xDFFF) // excludes the surrogate range
            utf8Length++;
    }
    return utf8Length;
    

    For simplicity, I used greater/lesser than comparisons, but given the byte values, a bitmask and checking against zero might perform better.