assemblyoptimizationx86micro-optimizationaddressing-mode

Is it more efficient to multiply within the address displacement or outside it?


I'm writing an x86 assembly routine that takes as arguments:

  1. At [ESP+4]: The number of 32-bit integers following.
  2. Starting at [ESP+8]: A list of 32-bit integers to add up.

And it returns the sum of all the integers passed starting at [ESP+8]. So, basically, a C prototype of the function would be:

int addN(int numberofitems, ...);

I have the option of writing this x86 assembly routine in two ways:

Is there any advantage, performance-wise or otherwise, to preferring one way over the other?


Solution

  • With modern CPUs it is very tricky to theorize about performance of particular source. But I will burn myself anyway.

    Especially as I didn't bother to learn anything about ASM performance coding in last decade, so most of my comments are based on tiny glimpses of thing here and there, not any thorough knowledge and experience.

    Step zero: figure out, how you will profile your code. Without real world profiling you will get nowhere, as everything I will describe next can make the result both faster and slower, obviously on different target CPUs, but even on the same target machine - depends how the rest of the executable will land around, so how alignments will turn out for other functions and how cache will cover the other functions code.

    First thing: use align directive at the start of the loop. (or align the procedure start in such way, that first instruction of loop will be aligned). How much? Looks like 16 will generally speed it up on most of current CPUs. This may have real effect on performance, but not only positive, usage for only frequently branched-to addresses is recommended.

    Subtleties:

    Let's test few variants, how they compile into machine code:

    0:  8b 04 94                mov    eax,DWORD PTR [esp+edx*4]
    3:  8b 04 14                mov    eax,DWORD PTR [esp+edx*1]
    6:  8b 04 24                mov    eax,DWORD PTR [esp]
    9:  8b 44 95 00             mov    eax,DWORD PTR [ebp+edx*4+0x0]
    d:  8b 44 15 00             mov    eax,DWORD PTR [ebp+edx*1+0x0]
    11: 8b 45 00                mov    eax,DWORD PTR [ebp+0x0] 
    

    As you can see, the *4 vs *1 variant has the same length, and the performance will be equal, so you don't have to worry about that *4 in addressing.

    So use whichever mode makes the rest of code shorter/faster. inc edx is 1B long opcode, add edx,4 is 3B long, so I would go for the first one just because in complex executable the shorter machine code will fit into cache better, and there shouldn't be any performance difference on modern x86 between inc and add - when considered in isolation from rest of code. When considered not in isolation, inc was evil on the Intel Pentium 4 CPUs a few years back, but the recent generations are again OK, so it should be as fast as add.

    (Now I did notice you use add eax,..., so one more time different variants of addressing that one):

    0:  42                      inc    edx
    1:  83 c2 04                add    edx,0x4
    4:  03 04 94                add    eax,DWORD PTR [esp+edx*4]
    7:  03 44 95 00             add    eax,DWORD PTR [ebp+edx*4+0x0]
    b:  03 04 14                add    eax,DWORD PTR [esp+edx*1]
    e:  03 44 15 00             add    eax,DWORD PTR [ebp+edx*1+0x0]
    12: 03 45 00                add    eax,DWORD PTR [ebp+0x0] 
    

    Now I thought I saw something about addressing through esp having additional prefix byte, but I don't see it here, so it was maybe in 16b? That's why I did test the ebp variants too, to get rid of esp. But as the esp has shorter machine code (ebp enforces displacement +0x0 byte usage), I would keep it just like you are using it now.


    On some older CPUs it would be probably faster to interleave the dependent instructions:

    AdderLoop:
        add eax, dword ptr [esp+edx*4]
        dec ecx
        lea edx, [edx+1]
        jnz AdderLoop
    

    But latests architectures use something called "macro-fusion" of instructions and dec + jnz pairs should be now kept together.


    And if you know the number of arguments will be most of the time considerably big (unlikely, as that would overflow the result value), you can of course unroll the loop for few iterations (4,8 or 16, wouldn't go higher due to cache pollution by large code).


    Then again, if the number of arguments would be considerably high, you would probably end waiting for memory to load the values for most of the loop.

    And then any of the code variants above would ends with the same performance, as memory cache-miss is worth of tens to hundreds instructions, not single-instruction nuance in addressing mode.

    Did I warn you, that it's very tricky? I did, in the first sentence.


    Conclusion:

    Don't bother with this, you are wasting your time.

    Simply write the simplest most readable source which is correct (in your particular case I prefer the *4 variant with "index"-like source).

    Once you have your application done, profile it.

    Fix real bottlenecks.