assemblyx86x86-64cpu-architecturemicro-architecture

uops for integer DIV instruction


I was looking the Agner Fog's instruction tables here, specifically I was looking at the sandy bridge case, and there is one thing that has caught my attention. If you look DIV instructions you can see that, for example, r64 DIV instruction can be decoded up to 56 uops! My question is: is it true or have I made a missinterpretation?

This is something that doesn't even get into my head. I've always thougt that an integer division of 2 registers was decoded in only 1 uop. And thought that that uop was dispatched to Port 0 (for example in Sandy Bridge).

What I thought that happenned here is: The uop is dispatched to Port0 and it finishes some cycles later. But, thanks to the pipelining, 1 div uop (or another uop that needs port0) can be sent to that port on each cycle. But this has completely broken my schemes: 56 different uops which need to be dispatched in 56 different cycles and occuping 56 ROB entries to ONLY do 1 integer division?


Solution

  • Not all of those uops run on the actual divider unit on port 0. It seems only signed idiv is that many uops on Skylake, div r64 is "only" 33 uops. Perhaps signed idiv r64 is taking absolute values to do extended-precision division using a narrower HW divider unit, like you'd do for software extended-precision? (Why is __int128_t faster than long long on x86-64 GCC?)

    And idiv/div r32 is "only" 10 uops, probably only 1 or 2 of them needing the actual divide unit on port 0, the others doing IDK what on other ports. Note the counts for arith.divider_active shown in Skylake profile results on Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux - div r64 with small inputs barely keeps the actual port 0 divider active for longer than div r32, but the other overhead makes it much slower.

    FP division is actually single-uop because FP div performance is important in some real-world algorithms. (Especially effect of one divpd on front-end throughput of surrounding code). See Floating point division vs floating point multiplication

    See also Do FP and integer division compete for the same throughput resources on x86 CPUs? - Ice Lake improves the divider HW.


    See also discussion in comments clearing up other misconceptions.

    Related:

    I think I've read that modern divider units are typically built with an iterative not-fully-pipelined part, and then 2 Newton Raphson steps which are pipelined. So that's how division can be partially pipelined on modern CPUs: the next one can start as soon as the current one can move into the Newton-Raphson pipelined part of the execution unit.