c++g++compiler-optimizationinteger-divisionconstexpr-function

Why does gcc 12.2 not optimise divisions into shifts in this constexpr function called from main()


I have been playing around with the Godbolt Compiler and typed this code:

constexpr int func(int x)
{
    return x > 3 ? x * 2 : (x < -4 ? x - 4 : x / 2);
}

int main(int argc)
{
    return func(argc);
}

The code is somewhat straight forward. The important part here is the final division by 2 inside func(int x). Since x is an integer, basically any compiler would simplify this to a shift to avoid division instructions.

The assembly from x86-64 gcc 12.2 -O3 (for Linux, thus System V ABI) looks like this:

main:
        cmp     edi, 3
        jle     .L2
        lea     eax, [rdi+rdi]
        ret
.L2:
        cmp     edi, -4
        jge     .L4
        lea     eax, [rdi-4]
        ret
.L4:
        mov     eax, edi
        mov     ecx, 2
        cdq
        idiv    ecx
        ret

You can see the final idiv ecx command which is not a shift but an actual division by 2. I also tested clang and clang does actually reduce this to a shift.

main:                                   # @main
        mov     eax, edi
        cmp     edi, 4
        jl      .LBB0_2
        add     eax, eax
        ret
.LBB0_2:
        cmp     eax, -5
        jg      .LBB0_4
        add     eax, -4
        ret
.LBB0_4:
        mov     ecx, eax
        shr     cl, 7
        add     cl, al
        sar     cl
        movsx   eax, cl
        ret

Could this be due to inlining maybe? I am very curious about whats going on here.


Solution

  • GCC treats main specially: implicit __attribute__((cold))

    So main gets optimized less (or favouring size over speed), as it's usually only called once in most programs. __attribute__((cold)) isn't quite the same as -Os (optimize for size), but it's a step in that direction which sometimes gets the cost heuristics to pick a naive division instruction.

    As GCC dev Marc Glisse commented, don't put your code in a function called main if you're benchmarking it or looking at how it optimizes. (There can be other special things besides cold, e.g. MinGW GCC puts in an extra call to an init function, and gcc -m32 adds code to align the stack by 16. All of which are noise you don't want for the code you're looking at. See also How to remove "noise" from GCC/clang assembly output?)

    Another Q&A shows GCC putting main in the .text.startup section, along with other presumed "cold" functions. (This is good for TLB and paging locality; hopefully a whole page of init functions can get evicted after process startup. The idea is that code in main probably only runs once, with real work happening inside some function it calls. This may not be true if real work inlines into main, or for simple programs.)

    This is a bad heuristic for toy programs with all their code in main, but it's what GCC does. Most real programs people run regularly aren't toys and have enough code in some other function that it doesn't inline into main. Although it would be nice if the heuristic was a little smarter and removed the cold if it turned out that the whole program or all the functions in a loop did optimize into main, since some real programs are pretty simple.

    You can override the heuristic with a GNU C function attribute.


    int main(int x, char **y){ return x/2; } does still use shifts with gcc -O2, so main being cold doesn't always have that effect (unlike -Os).

    But perhaps with your division already being conditional, GCC guesses that basic block doesn't even run every time so that's more reason to make it small instead of fast.


    Insanely, GCC -Os for x86-64 (Godbolt) does use idiv for signed division a constant 2, not just for arbitrary constants (where GCC normally uses a multiplicative inverse even at -O0). It doesn't save much if any code size vs. an arithmetic right shift with a fixup to round towards zero (instead of -inf), and can be much slower, especially for 64-bit integers on Intel before Ice Lake. Same for AArch64, where it's 2 fixed-size instructions either way, with sdiv almost certainly being much slower.

    sdiv does save some code size on AArch64 for higher powers of 2 (Godbolt), but still so much slower that it's probably not a good tradeoff for -Os. idiv doesn't save instructions on x86-64 (as cdq or cqo into RDX is required), although maybe a couple bytes of code size. So probably only good for -Oz where it would also use push 2 / pop rcx to get a small constant into a register in 3 bytes of x86-64 machine code instead of 5.