I had read this great tutorial
http://www.joelonsoftware.com/articles/Unicode.html
But I didn't understand how UTF-8 solves high-endian, low-endian machines thing. For 1byte, its fine. For multi byte, how it works?
Can someone explain better?
Here is a link that explains UTF-8 in depth. http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
At the heart of it, UTF-16 is short integer(16 bit) oriented and UTF-8 is byte oriented. Since architectures can differ on how the bytes of a datatypes are ordered(big endian, little endian) the UTF-16 encoding can go either way. On all architectures I am aware of there is no endian-ness at the nibble or semi-octet level. All bytes are a sequential series of 8 bits. Therefore UTF-8 has no endian-ness.
The Japanese character あ is a good example. It is U+3042 (binary=0011 0000 : 0100 0010).