stringassemblyx86masmmasm32

When using the MOV mnemonic to load/copy a string to a memory register in MASM, are the characters stored in reverse order?


I want to know if using the MOV instruction to copy a string into a register causes the string to be stored in reverse order. I learned that when MASM stores a string into a variable defined as a word or higher (dw and larger sizes) the string is stored in reverse order. Does the same thing happen when I copy a string to a register?

Based on this questions (about the SCAS instruction and about assigning strings and characters to variables in MASM 32) I assumed the following:

  1. When MASM loads a string into a variable, it loads it in reverse order, i.e. the last character in the string is stored in the lowest memory address (beginning) of the string variable. This means assigning a variable str like so: str dd "abc" causes MASM to store the strings as "cba", meaning "c" is in the lowest memory address.
  2. When defining a variable as str db "abc" MASM treats str as an array of characters. Trying to match the array index with the memory address of str, MASM will store "a" at the lowest memory address of str.
  3. By default, the SCAS and MOVS instructions execute from the beginning (lowest) address of the destination string, i.e. the string stored in the EDI register. They do not "pop" or apply the "last in, first out" rule to the memory addresses they operate on before executing.
  4. MASM always treats character arrays and strings to memory registers the same way. Moving the character array 'a', 'b', 'c' to EAX is the same as moving "abc" to EAX.

When I transfer a byte array arLetters with the characters 'a', 'b', and 'c' to the double-word variable strLetters using MOVSD, I believe the letters are copied to strLetters in reverse, i.e. stored as "cba". When I use mov eax, "abc" are the letters also stored in reverse order?

The code below will set the zero flag before it exits.

.data?
strLetters dd ?,0

.data
arLetters db "abcd"

.code

start:
mov ecx, 4
lea esi, arLetters
lea edi, strLetters
movsd
;This stores the string "dcba" into strLetters.

mov ecx, 4
lea edi, strLetters
mov eax, "dcba" 
repnz scasd
jz close
jmp printer
;strLetters is not popped as "abcd" and is compared as "dcba".

printer:
print "No match.",13,10,0
jmp close

close:
push 0
call ExitProcess

end start

I expect the string "dcba" to be stored in EAX "as is" - with 'd' in the lowest memory address of EAX - since MASM treats moving strings to registers different from assigning strings to variables. MASM copied 'a', 'b', 'c' 'd'" into strLetters as "dcba" to ensure that if strLetters was popped, the string is emmitted/released in the correct order ("abcd"). If the REP MOVSB instruction were used in place of MOVSD, strLetters would have contained "abcd" and would be popped/emmitted as "dcba". However, becasuse MOVSD was used and SCAS or MOVS instructions do not pop strings before executing, the code above should set the zero flag, right?


Solution

  • Don't use strings in contexts where MASM expects a 16-bit or larger integer. MASM will convert them to integers in a way that reverses the order of characters when stored in memory. Since this is confusing it's best to avoid this, and only use strings with the DB directive, which works as expected. Don't use strings with more than character as immediate values.

    Memory has a byte order, registers don't

    Registers don't have addresses, and it's meaningless to talk about the order of bytes within a register. On a 32-bit x86 CPU, the general purpose registers like EAX hold 32-bit integer values. You can divide a 32-bit value conceptually into 4 bytes, but while it lives in a register there is no meaningful order to the bytes.

    It's only when 32-bit values exist in memory do the 4 bytes that make them up have addresses and so have an order. Since x86 CPUs use the little-endian byte order that means the least-significant byte of the 4 bytes is the first byte. The most-significant part becomes the last byte. Whenever the x86 loads or stores a 16-bit or wider value to or from memory it uses the little-endian byte order. (An exception is the MOVBE instruction which specifically uses the big-endian byte order when loading and storing values.)

    So consider this program:

        .MODEL flat
    
        .DATA
    db_str  DB  "abcd"
    dd_str  DD  "abcd"
    num DD  1684234849
    
        .CODE
    _start: 
        mov eax, "abcd"
        mov ebx, DWORD PTR [db_str]
        mov ecx, DWORD PTR [dd_str]
        mov edx, 1684234849
        mov esi, [num]
        int 3
    
        END _start
    

    After assembling and linking it gets converted into sequence of bytes something like this:

    .text section:
      00401000: B8 64 63 62 61 8B 1D 00 30 40 00 8B 0D 04 30 40  ,dcba...0@....0@
      00401010: 00 BA 61 62 63 64 8B 35 08 30 40 00 CC           .Âșabcd.5.0@.I
      ...
    .data section:
      00403000: 61 62 63 64 64 63 62 61 61 62 63 64              abcddcbaabcd
    

    (On Windows the .data section normally gets placed after the .text section in memory.)

    DB and DD treat strings differently

    So we can see that the DB and DD directives, the ones labelled db_str and dd_str, generates two different sequences of bytes for the same string "abcd". In the first case, the MASM generates a sequence of bytes that we would we would expect, 61h, 62h, 63h, and 64h, the ASCII values for a, b, c, and d respectively. For dd_str though the sequence of bytes is reversed. This is because the DD directive uses 32-bit integers as operands, so the string has to be converted to a 32-bit value and MASM ends up reversing the order of characters in the string when the result of the conversion gets stored in memory.

    In memory, strings and numbers are both just bytes

    You'll also notice the DD directive labelled num also generated the same sequence of bytes that the DB directive. Indeed, without looking at the source there's no way to tell that the first four bytes are supposed to be a string while the last four bytes are supposed to be a number. They only become strings or numbers if the program uses them that way.

    (Less obvious is how the decimal value 1684234849 was converted into the same sequence bytes as generated by the DB directive. It's already a 32-bit value, it just needs to be converted into a sequence of bytes by MASM. Unsurprisingly, the assembler does so using the same little-endian byte order that the CPU uses. That means the first byte is the least significant part of 1684234849 which happens to have the same value as the ASCII letter a (1684234849 % 256 = 97 = 61h). The last byte is the most significant part of the number, which happens to be the ASCII value of d (1684234849 / 256 / 256 / 256 = 100 = 64h).)

    Immediates treat strings like DD does

    Looking the the values in the .text section more closely with a disassembler, we can see how the sequence of bytes stored there will interpreted as instructions when executed by the CPU:

      00401000: B8 64 63 62 61     mov         eax,61626364h
      00401005: 8B 1D 00 30 40 00  mov         ebx,dword ptr ds:[00403000h]
      0040100B: 8B 0D 04 30 40 00  mov         ecx,dword ptr ds:[00403004h]
      00401011: BA 61 62 63 64     mov         edx,64636261h
      00401016: 8B 35 08 30 40 00  mov         esi,dword ptr ds:[00403008h]
      0040101C: CC                 int         3
    

    What we can see here is that that MASM stored the bytes that make up the immediate value in the instruction mov eax, "abcd" in the same order it did with the dd_str DD directive. The first byte of the immediate part of the instruction in memory is 64h, the ASCII value of d. The reason why is because the with a 32-bit destination register this MOV instruction uses a 32-bit immediate. That means that MASM needs to convert the string to a 32-bit integer and ends up reversing the order of bytes as it did with dd_str. MASM also handles the decimal number given as the immediate to the mov ecx, 1684234849 the same way it did with the DD directive that used the same number. The 32-bit value was converted to same little-endian representation.

    In memory, instructions are also just bytes

    You'll also notice that the disassembler generated assembly instructions that use hexadecimal values for the immediates of these two instruction. Like the CPU, the assembler has no way of knowing that immediate values are supposed be strings and decimal numbers. They're just a sequence of bytes in the program, all it knows is that they're 32-bit immediate values (from the opcodes B8h and B9h) and so displays them as 32-bit hexadecimal values for the lack of any better alternative.

    Values in registers reflect memory order

    By executing the program under a debugger and inspecting the registers after it reaches the breakpoint instruction (int 3) we can see what actually ended up in the registers:

    eax=61626364 ebx=64636261 ecx=61626364 edx=64636261 esi=64636261 edi=00000000
    eip=0040101c esp=0018ff8c ebp=0018ff94 iopl=0         nv up ei pl zr na pe nc
    cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000246
    image00000000_00400000+0x101c:
    0040101c cc              int     3
    

    Now we can see that the first and third instructions loaded a different value than the other instructions. These two instruction both involve cases where MASM converted the string to a 32-bit value and ended up reversing order of the characters in memory. The register dump confirms that reversed order of bytes in memory in memory results in different values being loaded into the registers.

    But really, registers don't have a byte order

    Now you might be looking at that register dump above and thinking that only EAX and ECX is in the correct order, with the ASCII value for a, 61h first and and the ASCII value for d, 64h last. That MASM reversing the order of the strings in memory actually caused them to be loaded into registers in the correct order. But as I said before, there's no byte order in registers. The number 61626364 is just how the debugger represents the value when displaying it as a sequence of characters you can read. The characters 61 come first in the debugger's representation because our numbering system puts the most significant part of the number on the left, and we read left-to-right so that makes it the first part. However, as I also said before, x86 CPUs are little-endian, which means the least significant part comes first in memory. That means the first byte in memory becomes the least significant part of the value in the register, which gets displayed as the rightmost two hexadecimal digits of the number by the debugger because that's where least significant part the number goes in our numbering system.

    In other words because x86 CPUs are little-endian, least significant first, but our numbering system is big-endian, most significant first, hexadecimal numbers get displayed in a byte-wise reverse order to how they're actually stored in memory.

    Simply copying "strings" won't change their order

    It should also be hopefully clear by now that loading a string into a register is only something that happens conceptually. The string gets converted into a sequence of bytes by the assembler, which when loaded into a 32-bit register, gets treated as little-endian 32-bit integer in memory. When the 32-bit value in the register is stored in memory the 32-bit value is converted into a sequence of bytes that represent the value in little-endian format. To the CPU your string is just a 32-bit integer it loaded and stored to and from memory.

    So that means that if the value loaded into EAX in the sample program is stored to memory with something like mov [mem], eax then the the 4 bytes stored at mem will be in the same order as they appeared in the bytes that made up the immediate of mov eax, "abcd". That is in the same reversed order, 64h, 63h, 62h, 61h, that MASM put them in the bytes that make up immediate.

    But why? I dunno, just don't do that

    Now as to why MASM is reversing the order of strings when converting them to 32-bit integers I don't know, but the moral here is not to use strings as immediates or any other context where they need to be converted to integers. Assemblers are inconsistent on how they convert string literals into integers. (A similar problem occurs in how C compilers convert character literals like 'abcd' into integers.)

    SCASD and MOVSD aren't special

    Nothing special happens with the SCASD or MOVSD instrucitons. SCASD treats the four bytes pointed to by EDI as a 32-bit little-endian value, loads it into an unnamed temporary register, compares the temporary register to EAX, and then adds or subtracts 4 from EDI depending on the DF flag. MOVSD loads a 32-bit value in memory pointed to by ESI into an unnamed temporary register, stores the temporary register the 32-bit memory location pointed to by EDI, and then updates ESI and EDI according to the DF flag. (Byte order doesn't matter for MOVSD as the bytes are never used as a 32-bit value, but the order isn't changed.)

    I wouldn't try to think of SCASD or MOVSD as FIFO or LIFO because ultimately that depends on how you use them. MOVSD can just as easily be used as part of an implementation of FIFO queue as a LIFO stack. (Compare this to PUSH and POP, which in theory could independently be used part of an implementation of either a FIFO or LIFO data structure, but together can only be used to implement a LIFO stack.)