cassemblyoptimization

gcc optimizations while copying an array


I need to profile an application which performs a lot of array copies, so I ended up profiling this very simple function:

typedef unsigned char UChar;
void copy_mem(UChar *src, UChar *dst, unsigned int len) {
        UChar *end = src + len;
        while (src < end)
                *dst++ = *src++;
}

I'm using Intel VTune to do the actual profiling, and from there I've seen that there are dramatic differences when compiling with gcc -O3 and "plain" gcc (4.4).

To understand the why and how, I've got the assembly output of both compilation.

The unoptimized version is this one:

.L3:
        movl    8(%ebp), %eax
        movzbl  (%eax), %edx
        movl    12(%ebp), %eax
        movb    %dl, (%eax)
        addl    $1, 12(%ebp)
        addl    $1, 8(%ebp)
.L2:
        movl    8(%ebp), %eax
        cmpl    -4(%ebp), %eax
        jb      .L3
        leave

So I see that it first load a dword from *src and puts the lower byte into edx, then it stores it into *dst and updates the pointers: simple enough.

Then I saw the optimized version, and I didn't understand nothing.

EDIT: here there is the optimized assembly.

My question therefore is: what kind of optimizations gcc can do in this function?


Solution

  • That optimized code is quite a mess, but I can spot 3 loops (near L6, L13 and L12). I think gcc does what @GJ suggested (I upvoted him). The loop near L6 moves 4 bytes every time, while loop #2 moves only one byte and is executed only sometimes after loop #1. I still can't get loop #3 since it's identical to loop #2.