textunicodebinary-data

The byte-order in UTF8


I've read a question elsewhere about "Why there isn't a need for the Byte Order Mark in UTF-8", especially regarding systems with different Endianness.

To me the TLDR; is that UTF-8 enforces you to write the same numbers in the same order to memory, those numbers being each byte of the code, and it always reads the same (byte by byte).


A more extended explanation

Some character could have the code 11100010 10000010 10101100, and will be represented the same in any computer, that's because the utf8 indicates how to encode a character.

Each of those bytes will be read (as the decoding process would indicate.) one by one, so again, there is no ambiguity.

When the first byte starts with 111 they have to search for 3 bytes, and then use it for a look up / retrieve the string. This is how multi-byte characters are read.

Is this a correct explanation ? If you disagree, why and what is the correct reason ?


Solution

  • I'm not quite what you're asking so this might not be the answer you're looking for.

    Byte order only matters when you're dealing with integer primitives that are larger than a single byte.

    For example, if you're storing the number 5 as a 16-bit value, it would be naturally stored as the following on big-endian (e.g. ARM) hardware:

    00000000 00000101
    

    Whereas on little-endian (e.g. x86) hardware it would be stored as the following:

    00000101 00000000
    

    Since UTF-8 consists entirely of a stream of bytes, byte order is never a consideration. Yes, there are code points which require multiple bytes to represent, but reading and writing of those code points still needs to be done one byte at a time. The order of those bytes is well-defined, and the endianness of the hardware does not matter.