I'm writing an x86 assembly routine that takes as arguments:
[ESP+4]
: The number of 32-bit integers following.[ESP+8]
: A list of 32-bit integers to add up.And it returns the sum of all the integers passed starting at [ESP+8]
. So, basically, a C prototype of the function would be:
int addN(int numberofitems, ...);
I have the option of writing this x86 assembly routine in two ways:
First way (multiplication by item size is within the address displacement):
addN PROC
mov ecx, dword ptr [esp+4] ; Dec ecx per argument processed
mov edx, 2 ; Offset into variable length argument list
xor eax, eax ; Accumulator
AdderLoop:
add eax, dword ptr [esp+edx*4]
inc edx
dec ecx
jnz AdderLoop
ret
addN ENDP
Second way (size of item is added to edx
itself):
addN PROC
mov ecx, dword ptr [esp+4] ; Dec ecx per argument processed
mov edx, 8 ; Offset into variable length argument list
xor eax, eax ; Accumulator
AdderLoop:
add eax, dword ptr [esp+edx]
add edx, 4
dec ecx
jnz AdderLoop
ret
addN ENDP
Is there any advantage, performance-wise or otherwise, to preferring one way over the other?
With modern CPUs it is very tricky to theorize about performance of particular source. But I will burn myself anyway.
Especially as I didn't bother to learn anything about ASM performance coding in last decade, so most of my comments are based on tiny glimpses of thing here and there, not any thorough knowledge and experience.
Step zero: figure out, how you will profile your code. Without real world profiling you will get nowhere, as everything I will describe next can make the result both faster and slower, obviously on different target CPUs, but even on the same target machine - depends how the rest of the executable will land around, so how alignments will turn out for other functions and how cache will cover the other functions code.
First thing: use align directive at the start of the loop. (or align the procedure start in such way, that first instruction of loop will be aligned). How much? Looks like 16 will generally speed it up on most of current CPUs. This may have real effect on performance, but not only positive, usage for only frequently branched-to addresses is recommended.
Subtleties:
Let's test few variants, how they compile into machine code:
0: 8b 04 94 mov eax,DWORD PTR [esp+edx*4]
3: 8b 04 14 mov eax,DWORD PTR [esp+edx*1]
6: 8b 04 24 mov eax,DWORD PTR [esp]
9: 8b 44 95 00 mov eax,DWORD PTR [ebp+edx*4+0x0]
d: 8b 44 15 00 mov eax,DWORD PTR [ebp+edx*1+0x0]
11: 8b 45 00 mov eax,DWORD PTR [ebp+0x0]
As you can see, the *4
vs *1
variant has the same length, and the performance will be equal, so you don't have to worry about that *4
in addressing.
So use whichever mode makes the rest of code shorter/faster. inc edx
is 1B long opcode, add edx,4
is 3B long, so I would go for the first one just because in complex executable the shorter machine code will fit into cache better, and there shouldn't be any performance difference on modern x86 between inc
and add
- when considered in isolation from rest of code. When considered not in isolation, inc
was evil on the Intel Pentium 4 CPUs a few years back, but the recent generations are again OK, so it should be as fast as add
.
(Now I did notice you use add eax,...
, so one more time different variants of addressing that one):
0: 42 inc edx
1: 83 c2 04 add edx,0x4
4: 03 04 94 add eax,DWORD PTR [esp+edx*4]
7: 03 44 95 00 add eax,DWORD PTR [ebp+edx*4+0x0]
b: 03 04 14 add eax,DWORD PTR [esp+edx*1]
e: 03 44 15 00 add eax,DWORD PTR [ebp+edx*1+0x0]
12: 03 45 00 add eax,DWORD PTR [ebp+0x0]
Now I thought I saw something about addressing through esp
having additional prefix byte, but I don't see it here, so it was maybe in 16b? That's why I did test the ebp
variants too, to get rid of esp
. But as the esp
has shorter machine code (ebp
enforces displacement +0x0
byte usage), I would keep it just like you are using it now.
On some older CPUs it would be probably faster to interleave the dependent instructions:
AdderLoop:
add eax, dword ptr [esp+edx*4]
dec ecx
lea edx, [edx+1]
jnz AdderLoop
But latests architectures use something called "macro-fusion" of instructions and dec + jnz
pairs should be now kept together.
And if you know the number of arguments will be most of the time considerably big (unlikely, as that would overflow the result value), you can of course unroll the loop for few iterations (4,8 or 16, wouldn't go higher due to cache pollution by large code).
Then again, if the number of arguments would be considerably high, you would probably end waiting for memory to load the values for most of the loop.
And then any of the code variants above would ends with the same performance, as memory cache-miss is worth of tens to hundreds instructions, not single-instruction nuance in addressing mode.
Did I warn you, that it's very tricky? I did, in the first sentence.
Don't bother with this, you are wasting your time.
Simply write the simplest most readable source which is correct (in your particular case I prefer the *4
variant with "index"-like source).
Once you have your application done, profile it.
Fix real bottlenecks.