I need to profile an application which performs a lot of array copies, so I ended up profiling this very simple function:
typedef unsigned char UChar;
void copy_mem(UChar *src, UChar *dst, unsigned int len) {
UChar *end = src + len;
while (src < end)
*dst++ = *src++;
}
I'm using Intel VTune to do the actual profiling, and from there I've seen that there are dramatic differences when compiling with gcc -O3 and "plain" gcc (4.4).
To understand the why and how, I've got the assembly output of both compilation.
The unoptimized version is this one:
.L3:
movl 8(%ebp), %eax
movzbl (%eax), %edx
movl 12(%ebp), %eax
movb %dl, (%eax)
addl $1, 12(%ebp)
addl $1, 8(%ebp)
.L2:
movl 8(%ebp), %eax
cmpl -4(%ebp), %eax
jb .L3
leave
So I see that it first load a dword from *src and puts the lower byte into edx, then it stores it into *dst and updates the pointers: simple enough.
Then I saw the optimized version, and I didn't understand nothing.
EDIT: here there is the optimized assembly.
My question therefore is: what kind of optimizations gcc can do in this function?
That optimized code is quite a mess, but I can spot 3 loops (near L6, L13 and L12). I think gcc does what @GJ suggested (I upvoted him). The loop near L6 moves 4 bytes every time, while loop #2 moves only one byte and is executed only sometimes after loop #1. I still can't get loop #3 since it's identical to loop #2.