assemblyx86fasm

endianness doesn't affect writing but reading in memory


I've been reaching a conclusion that in both little endian and big endian.

We write to memory from left to right so that means that the number 0x00FF will be written as the following in both systems:

1000:00

1001:FF

However the reading differs between endianness.

In little endian we will read those two bytes

1000:00

1001:FF

as 0xFF00 and in big endian we will read it as 0x00FF

Now you may say then why if I do something like:

mov word [esp],0x00FF

in little endian processor the result will be 0x00FF, but I said that in little endian the result will be 0xFF00 so it's completely debunks what I've said.

Well it's seems like the assembler just reversed the number to 0xFF00 take a look:

enter image description here

If the assembler wouldn't reverse the number we would read it as 0xFF00.

So basically because the assembler reversed it we write the number as

1000:FF

1001:00

in memory, and we will start to read it from the least significant byte so we will get 0x00FF

Am I right, or does it work differently?


Solution

  • Endianness is a relationship between numeric values that span multiple storage units, usually bytes, and is expressible as a pair of formulas for decomposing and recomposing — for converting a single value (that needs multiple bytes) into a sequence of bytes, and back from sequence of bytes to a single value.

    (Endianness doesn't tell us how the processor performs these operations, just that they work according to the formulas below.  So, specifically, we don't know what ordering in time are used for fulfilling the formulas; the formulas are independent of time, but rather only sensitive to byte ordering in the sequence.)

    For example, in 16 bits we have a value 0x1234, and are going to store it in memory as a sequence of bytes, namely, a lower byte, stored at a lower address and a higher, stored at higher address, where the higher address = lower address + 1.


    The following formulas decomposes the value using little endian:

    lower byte  = 0x1234 & 0x00FF            = 0x34
    higher byte = 0x1234 / 256 = 0x1234 >> 8 = 0x12
    

    The little endian recomposition formula is

    value = lower byte + higher byte * 256 = 0x34 + 0x12 * 256 = 0x1234
    

    For big endian, the formulas (as compared with little endian) swaps which byte is multiplied/divided:

    lower byte  = 0x1234 / 256    = 0x1234 >> 8 = 0x12
    higher byte = 0x1234 & 0x00FF =               0x34
    

    And recomposition:

    value = lower byte * 256 + higher byte = 0x12 * 256 + 0x34 = 0x1234
    

    These formulas are built into the processors and well known in advance, so, when the assembler is assembling data as in:

    .data
    dw 0x1234
    

    It knows that (1) this is 16-bit data and (2) the target hardware is little endian.  So, it will put 0x34, 0x12 as bytes in memory, following the formulas for little endian decomposition.  (Again, it is not time ordering but relative sequencing.)

    For instructions, we can say that the assembler encodes the instructions and any immediates needed according to the machine code instruction set architecture.  When an immediate is materialized, it comes back as part of instruction decoding.  Due to the way intel processors work, the encoding within the machine code instruction will also appear little endian, however, the encoding may be shorter than the full size of the immediate written in assembly language.  No matter, the processor will reconstitute the proper constant internally and then use it.  If an immediate is stored to memory it will use the little endian decomposition formulas to create the sequence of two bytes to store, just as it would when storing a register's value to memory.


    Because the formulas pair (decomp/recomp), reading a 16-bit location that was last written as a 16-bit original value, the original value comes back.  Only when we view that location as individual bytes will we need concern with endianness.

    Unfortunately, the debugger dumps memory as individual bytes, exposing us to endianness when the data is multi-byte data.  There is no way to tell from a memory dump alone what kind of values are stored there, whether 16-bit or 8-bit values.  That information, however, is in the program and its machine code instructions, in the way they treats those locations (as to whether it uses 16-bit memory accesses or 8-bit ones).


    When the program consistently uses the same memory the same way, it will get expected values.  But there are lots of opportunities for logic errors in programs in assembly.  Such errors include using the wrong size, using the wrong sign, failure to initialize.  Higher level languages have types that prevent the first two and good languages also have features to detect uninitalized variables.  But in machine code every single instruction repeats the relevant treatment of physical storage to accomplish consistency.

    (To be clear, it is not always an error to view a 16-bit value as a sequence of bytes, sometimes that is necessary, i.e. when storing a number in a file, or inside an assembler/compiler).