[SOLVED] Looking for the description of the algorithm to convert UTF8 to UTF16

Looking for the description of the algorithm to convert UTF8 to UTF16

I have 3 bytes representing an unicode char encoded in utf8. For example I have E2 82 AC (UTF8) that represent the unicode char € (U+20AC). Is their any algorithm to make this conversion? I know their is the windows api MultiByteToWideChar but I would like to know if their is a simple mathematical relation between E2 82 AC and U+20AC. So is the mapping between utf8 -> utf16 a simple mathematic function or if it's a hardcoded map.

Solution

Converting a valid UTF-8 byte sequence directly to UTF-16 is doable with a little mathematical know-how.

Validating a UTF-8 byte sequence is fairly trivial. See The Algorithm to Validate an UTF-8 String. Be sure to check for things like:

the first byte matches one of the patterns below, and that (byte and $C0) = $80 is true for each subsequent byte in the sequence.
the byte sequence is not more than 4 bytes long.
the byte sequence is not an "overlong" encoding of the codepoint (ie, it must use the least number of bytes required to encode the codepoint).
the codepoint does not have a restricted value, such as the reserved UTF-16 surrogates U+D800..U+DFFF, or be greater than U+10FFFF. These codepoints should NEVER appear in any valid Unicode text.

The first byte in a UTF-8 sequence tells you how many bytes are in the sequence:

(byte1 and $80) = $00: 1 byte  // 0xxxxxxx
(byte1 and $E0) = $C0: 2 bytes // 110xxxxx
(byte1 and $F0) = $E0: 3 bytes // 1110xxxx
(byte1 and $F8) = $F0: 4 bytes // 11110xxx
anything else: error

There are very simple formulas for converting UTF-8 1-byte, 2-byte, and 3-byte sequences to UTF-16, as they all represent Unicode codepoints below U+10000, and thus can be represented as-is in UTF-16 using just one 16-bit codeunit, no surrogates needed, just some bit twiddling, eg:

1 byte:

UTF16 = UInt16(byte1 and $7F)

2 bytes:

UTF16 = (UInt16(byte1 and $1F) shl 6)
        or UInt16(byte2 and $3F)

3 bytes:

UTF16 = (UInt16(byte1 and $0F) shl 12)
        or (UInt16(byte2 and $3F) shl 6)
        or UInt16(byte3 and $3F)

Converting a UTF-8 4-byte sequence to UTF-16, on the other hand, is slightly more involved, since it represents a Unicode code point that is U+10000 or higher, and thus will need to use UTF-16 surrogates, which requires some additional math to calculate, eg:

4 bytes:

CP = (UInt32(byte1 and $07) shl 18)
     or (UInt32(byte2 and $3F) shl 12)
     or (UInt32(byte3 and $3F) shl 6)
     or UInt32(byte4 and $3F)
CP = CP - $10000
highSurrogate = $D800 + UInt16((CP shr 10) and $3FF)
lowSurrogate = $DC00 + UInt16(CP and $3FF)
UTF16 = highSurrogate, lowSurrogate

Now, with that said, let's look at your example: E2 82 AC

The first byte $E2 has a bit pattern of 11100010, which matches the start pattern of a 3-byte sequence (ie ($E2 and $F0) = $E0 is true).

The second byte $82 has a bit pattern of 10000010, which matches the pattern of a continuation byte (ie ($82 and $C0) = $80 is true).

The third byte $AC has a bit pattern of 10101100, which matches the pattern of a continuation byte (ie ($AC and $C0) = $80 is true).

So, this is a readable UTF-8 3-byte sequence.

Plugging in those byte values into the 3-byte formula, you get:

UTF16 = (UInt16($E2 and $0F) shl 12)
        or (UInt16($82 and $3F) shl 6)
        or UInt16($AC and $3F)

      = (UInt16($02) shl 12)
        or (UInt16($02) shl 6)
        or UInt16($2C)

      = $2000
        or $80
        or $2C

      = $20AC

Further validating, the most compact form of U+20AC does require 3 bytes in UTF-8, thus proving that the byte sequence is not an "overlong" encoding. And U+20AC is not a restricted codepoint.

So, this is a valid UTF-8 3-byte sequence.

And indeed, Unicode codepoint U+20AC is encoded in UTF-16 as a single codeunit $20AC.