cassemblygcccpu-architecturemicro-optimization

Why is this reordering of sub and mul instructions helpful?


This is an example code from Computer Systems: A Programmer's Perspective.

short foo(short a, short b) {
    short result;
    result = b;
    while(b > 0) {
        result *= a;
        b -= a;
    }
    return result;
}

Compiling this with gcc14.2 using -O1 and -O3 gives the following assembly (godbolt)

O1

foo:
        movl    %esi, %eax
        movl    %esi, %ecx
        testw   %si, %si
        jle     .L1
.L3:
        imull   %edi, %eax
        subl    %edi, %ecx
        testw   %cx, %cx
        jg      .L3
.L1:
        ret

O3

foo:
        movl    %esi, %eax
        movl    %esi, %edx
        testw   %si, %si
        jle     .L1
.L3:
        subl    %edi, %edx
        imull   %edi, %eax
        testw   %dx, %dx
        jg      .L3
.L1:
        ret

With O3, sub comes before imul - why is this reordering helpful?

  1. I am assuming this has to do with instruction-level parallelism, but cannot get the exact reason. Is it that performing subtraction first will allow parts of multiplication to also get done with some idle ALUs?
  2. With respect to pipelining, is there any difference due to this reordering?

Solution

  • Just as you note, subl and imul can be executed in parallel (neither depends on the other), but testw does depend on subl's result (and theoretically, whether or not to use the result of the next imul depends on testw) As such, you want to complete, and therefore start, subl as soon as possible so that testw can start whether or not imul has completed.