I have been playing around with the Godbolt Compiler and typed this code:
constexpr int func(int x)
{
return x > 3 ? x * 2 : (x < -4 ? x - 4 : x / 2);
}
int main(int argc)
{
return func(argc);
}
The code is somewhat straight forward. The important part here is the final division by 2 inside func(int x)
. Since x is an integer, basically any compiler would simplify this to a shift to avoid division instructions.
The assembly from x86-64 gcc 12.2 -O3
(for Linux, thus System V ABI) looks like this:
main:
cmp edi, 3
jle .L2
lea eax, [rdi+rdi]
ret
.L2:
cmp edi, -4
jge .L4
lea eax, [rdi-4]
ret
.L4:
mov eax, edi
mov ecx, 2
cdq
idiv ecx
ret
You can see the final idiv ecx
command which is not a shift but an actual division by 2. I also tested clang and clang does actually reduce this to a shift.
main: # @main
mov eax, edi
cmp edi, 4
jl .LBB0_2
add eax, eax
ret
.LBB0_2:
cmp eax, -5
jg .LBB0_4
add eax, -4
ret
.LBB0_4:
mov ecx, eax
shr cl, 7
add cl, al
sar cl
movsx eax, cl
ret
Could this be due to inlining maybe? I am very curious about whats going on here.
main
specially: implicit __attribute__((cold))
So main
gets optimized less (or favouring size over speed), as it's usually only called once in most programs. __attribute__((cold))
isn't quite the same as -Os
(optimize for size), but it's a step in that direction which sometimes gets the cost heuristics to pick a naive division instruction.
As GCC dev Marc Glisse commented, don't put your code in a function called main
if you're benchmarking it or looking at how it optimizes. (There can be other special things besides cold
, e.g. MinGW GCC puts in an extra call
to an init function, and gcc -m32
adds code to align the stack by 16. All of which are noise you don't want for the code you're looking at. See also How to remove "noise" from GCC/clang assembly output?)
Another Q&A shows GCC putting main
in the .text.startup
section, along with other presumed "cold" functions. (This is good for TLB and paging locality; hopefully a whole page of init functions can get evicted after process startup. The idea is that code in main
probably only runs once, with real work happening inside some function it calls. This may not be true if real work inlines into main, or for simple programs.)
This is a bad heuristic for toy programs with all their code in main
, but it's what GCC does. Most real programs people run regularly aren't toys and have enough code in some other function that it doesn't inline into main
. Although it would be nice if the heuristic was a little smarter and removed the cold
if it turned out that the whole program or all the functions in a loop did optimize into main
, since some real programs are pretty simple.
You can override the heuristic with a GNU C function attribute.
__attribute__((hot)) int main(){ ...
optimizes the way you expect__attribute__((cold))
on a function not called main
produces idiv
.__attribute__((optimize("O3")))
doesn't help.int main(int x, char **y){ return x/2; }
does still use shifts with gcc -O2
, so main being cold
doesn't always have that effect (unlike -Os
).
But perhaps with your division already being conditional, GCC guesses that basic block doesn't even run every time so that's more reason to make it small instead of fast.
Insanely, GCC -Os
for x86-64 (Godbolt) does use idiv
for signed division a constant 2, not just for arbitrary constants (where GCC normally uses a multiplicative inverse even at -O0
). It doesn't save much if any code size vs. an arithmetic right shift with a fixup to round towards zero (instead of -inf), and can be much slower, especially for 64-bit integers on Intel before Ice Lake. Same for AArch64, where it's 2 fixed-size instructions either way, with sdiv
almost certainly being much slower.
sdiv
does save some code size on AArch64 for higher powers of 2 (Godbolt), but still so much slower that it's probably not a good tradeoff for -Os
. idiv
doesn't save instructions on x86-64 (as cdq
or cqo
into RDX is required), although maybe a couple bytes of code size. So probably only good for -Oz
where it would also use push 2
/ pop rcx
to get a small constant into a register in 3 bytes of x86-64 machine code instead of 5.