unicodeutf-8character-encodingbyte-order-mark

UTF-8 multibyte & bom


I had read this great tutorial
http://www.joelonsoftware.com/articles/Unicode.html

But I didn't understand how UTF-8 solves high-endian, low-endian machines thing. For 1byte, its fine. For multi byte, how it works?

Can someone explain better?


Solution

  • Here is a link that explains UTF-8 in depth. http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8

    At the heart of it, UTF-16 is short integer(16 bit) oriented and UTF-8 is byte oriented. Since architectures can differ on how the bytes of a datatypes are ordered(big endian, little endian) the UTF-16 encoding can go either way. On all architectures I am aware of there is no endian-ness at the nibble or semi-octet level. All bytes are a sequential series of 8 bits. Therefore UTF-8 has no endian-ness.

    The Japanese character あ is a good example. It is U+3042 (binary=0011 0000 : 0100 0010).

    Here is some information on unicode あ