assemblyx86floating-pointcpu-architecturex87

Is fdiv faster with a dword or qword argument?


I can choose between the following 80387 instructions:

fdiv dword ptr a

and

fdiv qword ptr b

The numbers a and b are equivalent, both of them are 100% accurate.

Is there a reason why I should choose the qword version? I can only think about speed as a benefit. Is there a difference on modern processors? Was there a difference on the 80387 or 80487?


Solution

  • dword size won't ever be slower. (Except due to some secondary effect like a narrower memory location here leading to some other data being misaligned.)

    On most CPUs, I think qword will be the same speed when it's the same value, other than data transfer time.

    On some CPUs, if the low bits of the mantissa aren't all zero (so a divisor with more significant bits than would fit in a dword float), that might be slower. Unless the "round number" early-out cases only apply for numbers with many fewer significant bits.


    Before P5 Pentium, it would take more cycles to transfer a qword than a dword, since it wasn't until P5 Pentium that data paths were widened to 64-bit. (Unless 486DX could transfer 64 bits between the FPU and its internal cache? 64-bit atomicity guarantees were new with P5.)

    I think memory-source FP math operations are equivalent to widening1 to 80-bit first as fld does, in terms of what data the FPU itself sees, so a qword float64 representing the exact same value as your dword float32 will result in the same input to the actual FPU work once it finishes loading the memory operand.

    For fadd/fsub/fmul instructions, performance doesn't depend on the input data or the precision setting in the x87 control word (how many mantissa bits it has to produce for the output), at least not on P5 or later.

    But fdiv and fsqrt do depend on the precision setting. https://agner.org/optimize/ instruction tables only include P5 as the earliest. fdiv cycles are 19/33/39 for output mantissa precision settings of 24, 53, or 64-bit, respectively, on P5. Lowering precision speeds up fdiv/fsqrt, but hurts precision for everything: fun fact: Direct3D by default would lower FPU precision to the minimum 24-bit, probably because 3D geometry does a lot of sqrt and division for magnitudes of vectors.

    On some CPUs, fdiv performance depends on the actual data. On many CPUs, Agner Fog's instruction tables include a note on fdiv timings. The first such note, for AMD K7 says "Low values [clock-cycle counts] are for round divisors, e.g. powers of 2." Powers of 2 are the most round, with an all-zero mantissa, but the phrasing implies that other values can be also be somewhat round and take fewer cycles.

    So I'm guessing it's not just powers of 2 that are fast. A qword float64 that's also an exact float32 is somewhat round: only 23 non-zero mantissa bits, with the low 29 being all zero. But that's still potentially a lot of non-zero mantissa bits; maybe too many for any special cases to apply, IDK.

    Later notes are more terse, just mentioning "round divisors", but presumably he means the same thing, not just powers of 2 as the only special case.

    Pentium-M / Core Solo/Duo has fdiv throughput or 8 to 37 cycles (and similar for divsd), with Agner Fog's note saying "High values are typical, low values are for low precision or round divisors." I think "low precision" is referring to the x87 control-word setting. I don't know if "round divisors" only means powers of 2 (all-zero mantissa), or if there's a sliding scale of how round it is, like how few significant mantissa bits there are. For integer div / idiv on P-M / Core 1, the note starts the same but adds "Core Solo/Duo is more efficient than Pentium M in cases with round values that allow an early-out algorithm."

    CPUs with such notes:


    Footnote 1: Widening is fairly trivial: pad the mantissa with zeros at the bottom, and adjust the biased exponent field so it represents the same power of 2. The 80-bit format also uses an explicit instead of implicit leading 1 (or 0 for subnormal) in the mantissa, so that bit is decoded from the exponent field.

    The decoding process should be the same amount of work for dword or qword loads; P5 Pentium ran fld m32 / m64 in a single clock cycle.