assemblyx86attaddressing-modearray-indexing

Assembly: What is the purpose of movl data_items(,%edi,4), %eax in this program


This program (from Jonathan Bartlett's Programming From the Ground Up) cycles through all the numbers stored in memory with .long and puts the largest number in the EBX register for viewing when the program completes.

.section .data
data_items:
    .long 3, 67, 34, 222, 45, 75, 54, 34, 44, 33, 22, 11, 66, 0

.section .text
.globl _start

_start:
    movl $0, %edi
    movl data_items (,%edi,4), %eax
    movl %eax, %ebx
start_loop:
    cmpl $0, %eax
    je loop_exit
    incl %edi
    movl data_items (,%edi,4), %eax
    cmpl %ebx, %eax
    jle start_loop
    movl %eax, %ebx
    jmp start_loop
loop_exit:
    movl $1, %eax
    int $0x80

I'm not certain about the purpose of (,%edi,4) in this program. I've read that the commas are for separation, and that the 4 is for reminding our computer that each number in data items is 4 bytes long. Since we've already declared that each number is 4 bytes with .long, why do we need to do it again here? Also, could someone explain in more detail what purpose the two commas serve in this situation?


Solution

  • In AT&T syntax, memory operands have the following syntax1:

    displacement(base_register, index_register, scale_factor)
    

    The base, index and displacement components can be used in any combination, and every component can be omitted

    but obviously the commas must be retained if you omit the base register, otherwise it would be impossible for the assembler to understand which of those components you are leaving out.

    All this data gets combined to calculate the address you are specifying, with the following formula:

    effective_address = displacement + base_register + index_register*scale_factor
    

    (which incidentally is almost exactly how you would specify this in Intel syntax).

    So, armed with this knowledge we can decode your instruction:

    movl data_items (,%edi,4), %eax
    

    Matching the syntax above, you see that:

    So, you are telling the CPU to move a long from the location data_items+%edi*4 to the register %eax.

    The *4 is necessary because each element of your array is 4-bytes wide, so to transform the index (in %edi) to an offset (in bytes) from the start of the array you have to multiply it by 4.

    Since we've already declared that each number is 4 bytes with .long, why do we need to do it again here?

    Assemblers are low level tools that knows nothing about types.


    Notes

    1. Technically, there would also be the segment specifier, but given that we are talking about 32 bit code on Linux I'll omit segments entirely, as they would only add confusion.