I noticed that Clang does an interesting division optimization trick for the following snippet
int64_t s2(int64_t a, int64_t b)
{
return a/b;
}
Below is the assembly output if specifying march
as Sandy Bridge or above
mov rax, rdi
mov rcx, rdi
or rcx, rsi
shr rcx, 32
je .LBB1_1
cqo
idiv rsi
ret
.LBB1_1:
xor edx, edx
div esi
ret
Here are the Godbolt links for the signed version and the unsigned version
From what I understand it checks whether the high bits of the two operands are zero, and does a 32-bit division if that's true
I checked this table and see that the latencies for 32/64-bit division on Core2 and Nehalem are 40/116 and 26/89 respectively. Hence if the operands are indeed often not wide then the savings by doing a 32-bit division instead of a 64-bit one may be just as worth as on SnB
So why is it enabled only for SnB and later microarchitectures? Why don't other compilers like GCC or ICC do it?
I'm guessing that clang devs tested which uarches it was good on, and found it was only SnB-family.
That sounds right, because of a funky stall on P6-family, and AMD's different dividers.
Using the flag result from a shift imm8 (not a shift-by-implicit-1) on P6-family causes the front-end to stall before issuing the flag-reading instruction until the shift is retired. (Because the P6 decoders don't check for the imm8=0 case for leaving flags unmodified, while SnB does). INC instruction vs ADD 1: Does it matter?. That might be why clang doesn't use it for P6-family.
Probably a different way of checking the relevant condition that didn't cause this stall (like a test rcx,rcx
before the je
, would make it worth it on Core2/Nehalem). But if clang devs didn't realize the reason it was slow on P6-family, they wouldn't have thought to fix it, and just left it not done for pre-SnB targets. (Nobody added me to a patch review or bug CC list about this one, unfortunately; this is the first I've seen of clang doing this optimization. Although I think I might have mentioned shift flag stalls in comments on some other LLVM review or bug. Anyway, might be fun to try adding a test
and see if that makes it worthwhile on Nehalem.)
AMD's dividers have the same best-case div performance regardless of operand-size, presumably depending only on the actual magnitude of the inputs, according to Agner Fog. Only the worst-case grows with operand-size. So I think it's harmless to run idiv r64
with small inputs sign-extended to 128 / 64-bit on AMD. (div/idiv on AMD is 2 uops for all operand sizes (except 8-bit where it's one because it only has to write one output register: AH and AL = AX. Unlike Intel's microcoded integer division.)
Intel before Ice Lake is very different: idiv r32
is 9 uops, vs. idiv r64
being 59 uops, with a best-case throughput that's 3x worse, on Haswell. Other members of SnB-family up to Skylake and its derivatives (like Cascade Lake) are similar.
Ice Lake added a new integer-division unit that handles idiv r64
in the same number of uops as idiv r32
, so probably the same performance for the same input numbers on that and newer Intel, at least for the P-cores in hybrid CPUs like Alder Lake. (For idiv r64
with inputs that could have fit in i64 dividend and i32 divisor).
Related:
Why don't other compilers like GCC or ICC do it?
Probably because Clang developers thought of it, and GCC/ICC haven't copied them yet. If you've watched Chandler Carruth's talks about perf
, one example he used was playing around with a branch to skip a div
. I'm guessing this optimization was his idea. Looks nifty. :)