I can choose between the following 80387 instructions:
fdiv dword ptr a
and
fdiv qword ptr b
The numbers a
and b
are equivalent, both of them are 100% accurate.
Is there a reason why I should choose the qword
version? I can only think about speed as a benefit. Is there a difference on modern processors? Was there a difference on the 80387 or 80487?
dword
size won't ever be slower. (Except due to some secondary effect like a narrower memory location here leading to some other data being misaligned.)
On most CPUs, I think qword
will be the same speed when it's the same value, other than data transfer time.
On some CPUs, if the low bits of the mantissa aren't all zero (so a divisor with more significant bits than would fit in a dword float), that might be slower. Unless the "round number" early-out cases only apply for numbers with many fewer significant bits.
Before P5 Pentium, it would take more cycles to transfer a qword than a dword, since it wasn't until P5 Pentium that data paths were widened to 64-bit. (Unless 486DX could transfer 64 bits between the FPU and its internal cache? 64-bit atomicity guarantees were new with P5.)
I think memory-source FP math operations are equivalent to widening1 to 80-bit first as fld
does, in terms of what data the FPU itself sees, so a qword float64 representing the exact same value as your dword float32 will result in the same input to the actual FPU work once it finishes loading the memory operand.
For fadd
/fsub
/fmul
instructions, performance doesn't depend on the input data or the precision setting in the x87 control word (how many mantissa bits it has to produce for the output), at least not on P5 or later.
But fdiv
and fsqrt
do depend on the precision setting. https://agner.org/optimize/ instruction tables only include P5 as the earliest. fdiv
cycles are 19/33/39 for output mantissa precision settings of 24, 53, or 64-bit, respectively, on P5. Lowering precision speeds up fdiv/fsqrt, but hurts precision for everything: fun fact: Direct3D by default would lower FPU precision to the minimum 24-bit, probably because 3D geometry does a lot of sqrt and division for magnitudes of vectors.
On some CPUs, fdiv
performance depends on the actual data. On many CPUs, Agner Fog's instruction tables include a note on fdiv timings. The first such note, for AMD K7 says "Low values [clock-cycle counts] are for round divisors, e.g. powers of 2." Powers of 2 are the most round, with an all-zero mantissa, but the phrasing implies that other values can be also be somewhat round and take fewer cycles.
So I'm guessing it's not just powers of 2 that are fast. A qword float64 that's also an exact float32 is somewhat round: only 23 non-zero mantissa bits, with the low 29 being all zero. But that's still potentially a lot of non-zero mantissa bits; maybe too many for any special cases to apply, IDK.
Later notes are more terse, just mentioning "round divisors", but presumably he means the same thing, not just powers of 2 as the only special case.
Pentium-M / Core Solo/Duo has fdiv
throughput or 8 to 37 cycles (and similar for divsd
), with Agner Fog's note saying "High values are typical, low values are for low precision or round divisors." I think "low precision" is referring to the x87 control-word setting. I don't know if "round divisors" only means powers of 2 (all-zero mantissa), or if there's a sliding scale of how round it is, like how few significant mantissa bits there are. For integer div
/ idiv
on P-M / Core 1, the note starts the same but adds "Core Solo/Duo is
more efficient than Pentium M in cases with round values that allow an early-out algorithm."
CPUs with such notes:
K7, K8: round divisors are faster
K10 mentions that idiv
speed depends on number of significant bits in the absolute value of the dividend, and to see AMD's optimization manual. But no fdiv
note, fixed latency/throughput on K10. Same for bobcat / jaguar.
Bulldozer-family: variable latency and throughput for fdiv
(and partially pipelined with throughput better than latency), but no note explaining when.
Zen 1: Variable latency and throughput, no note
Zen 2 though Zen 4: fdiv
latency = 15 cycles, throughput = one per 6 cycles.
P5 Pentium: FDIV takes 19, 33, or 39 clock cycles for 24, 53, and 64 bit precision respectively. FIDIV takes 3 clocks more. The precision is defined by bit 8-9 of the floating point control word
P6 Pentium II / III: FDIV latency depends on precision specified in control word: 64 bits precision gives latency 38, 53 bits precision gives latency 32, 24 bits precision gives latency 18. Division by a power of 2 takes 9 clocks. Reciprocal throughput is 1/(latency-1). (No dependence on data values mentioned)
Pentium M / Core (1) Duo/Solo: High values are typical, low values are for low precision or round divisors.
Core 2 Merom and Wolfdale, and Nehalem: Round divisors or low precision give low values.
No notes on Sandybridge or later: fdiv
timings given as lat = 10-24c, recip throughput = 10-24c for SnB. IvB / Haswell slightly pipelines it (throughput a few cycles better than latency), Broadwell is when throughput is significantly better than latency.
Pentium 4: Latency and reciprocal throughput depend on the precision setting in the F.P.
control word. Single precision: 23, double precision: 38, long double precision
(default): 43. and also Throughput of FP-MUL unit is reduced during the use of the FP-DIV unit. (Despite fdiv
being a single uop issue/dispatch.)
Atom / Silvermont / Goldmont (plus) / Tremont: fdiv
latency and throughput are fixed, with not even a mention of the precision setting helping.
Via Nano 2000 / 3000: fdiv latency and throughput are 15-42 (Nano 2000) or 14-23 cycles (Nano 3000). No note, so maybe just precision control.
Footnote 1: Widening is fairly trivial: pad the mantissa with zeros at the bottom, and adjust the biased exponent field so it represents the same power of 2. The 80-bit format also uses an explicit instead of implicit leading 1
(or 0
for subnormal) in the mantissa, so that bit is decoded from the exponent field.
The decoding process should be the same amount of work for dword or qword loads; P5 Pentium ran fld m32 / m64
in a single clock cycle.