cgccx86compiler-optimizationalloca

Strange assembly code for c alloca function with optimization disabled - gcc uses DIV and IMUL by a constant 16, and shifts?


I have this simple code in c

#include <stdio.h>
#include <alloca.h>

int main()
{
    char* buffer = (char*)alloca(600);
    snprintf(buffer, 600, "Hello %d %d %d\n", 1, 2, 3);

    return 0;
}

I would expect that generated assembly code for alloca function would just decrement stack pointer(one sub instruction), and maybe do some alignments (one and instruction), but the resulting assembly code is very complicated and even more inefficient than you'd expect.

This is the output of objdump -d main.o, on the output of gcc -c (with no optimization, so the default -O0)

    0000000000400596 <main>:
  400596:   55                      push   %rbp
  400597:   48 89 e5                mov    %rsp,%rbp
  40059a:   48 83 ec 10             sub    $0x10,%rsp
  40059e:   b8 10 00 00 00          mov    $0x10,%eax
  4005a3:   48 83 e8 01             sub    $0x1,%rax
  4005a7:   48 05 60 02 00 00       add    $0x260,%rax
  4005ad:   b9 10 00 00 00          mov    $0x10,%ecx
  4005b2:   ba 00 00 00 00          mov    $0x0,%edx
  4005b7:   48 f7 f1                div    %rcx
  4005ba:   48 6b c0 10             imul   $0x10,%rax,%rax
  4005be:   48 29 c4                sub    %rax,%rsp
  4005c1:   48 89 e0                mov    %rsp,%rax
  4005c4:   48 83 c0 0f             add    $0xf,%rax
  4005c8:   48 c1 e8 04             shr    $0x4,%rax
  4005cc:   48 c1 e0 04             shl    $0x4,%rax
  4005d0:   48 89 45 f8             mov    %rax,-0x8(%rbp)
  4005d4:   48 8b 45 f8             mov    -0x8(%rbp),%rax
  4005d8:   41 b9 03 00 00 00       mov    $0x3,%r9d
  4005de:   41 b8 02 00 00 00       mov    $0x2,%r8d
  4005e4:   b9 01 00 00 00          mov    $0x1,%ecx
  4005e9:   ba a8 06 40 00          mov    $0x4006a8,%edx
  4005ee:   be 58 02 00 00          mov    $0x258,%esi
  4005f3:   48 89 c7                mov    %rax,%rdi
  4005f6:   b8 00 00 00 00          mov    $0x0,%eax
  4005fb:   e8 a0 fe ff ff          callq  4004a0 <snprintf@plt>
  400600:   b8 00 00 00 00          mov    $0x0,%eax
  400605:   c9                      leaveq 
  400606:   c3                      retq   
  400607:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  40060e:   00 00 

Any idea what is the aim of this generated assembly code? I'm using gcc 8.3.1.


Solution

  • There is of course the usual debug-mode / anti-optimized behaviour of compiling each C statement to a separate block, with non-register variables actually in memory. (Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?).

    But yes, this goes beyond "not optimized". No sane person would expect GCC's canned sequence of instructions (or GIMPLE or RTL logic, whatever stage it's expanded) for alloca logic to involve a div by compile-time-constant power of 2, instead of a shift or just an AND. x /= 16; doesn't compile to a div if you write that yourself in C source, even with gcc -O0.

    Normally GCC does compile-time evaluation of constant expressions as much as possible, like x = 5 * 6 won't use imul at runtime. But the point at which it expands its alloca logic must be after that point, probably pretty late (after most other passes) to explain all those missed optimizations. So it doesn't benefit from the same passes that operate on your C source logic.

    It's doing 2 things:

    Then of course it has to store the pointer into char* buffer.

    And in the block of asm for the next statement, reload it as an arg for sprintf (inefficiently into RAX instead of directly into RDI, typical for gcc -O0), along with setting up the register args.


    So this sucks a lot, but is very plausibly explained by late expansion of the canned logic for alloca, after most transformation ("optimization") passes have already run. Note that -O0 doesn't literally mean "no optimization", it just means "compile fast, and give consistent debugging".


    Related: