To compute the inverse of the square root of a double float in C, the most natural code would be the following: 1 / sqrt(x)
(sqrt being a math.h function). At high optimization levels, I expected that to compile down to a square root instruction and a double division instruction. Indeed, the assembly includes the sqrtsd
and divsd
instructions, but also maintains a call to the sqrt function, which confuses me: why both the sqrtsd instruction and the function call?
(My benchmark does sum += 1 / sqrt(i);
in a loop with int i
from 1 to 1G to use the result so it won't optimize away.)
Nevertheless, I expect the cost of the expression to be dominated by the sequential application of the square root and double division instructions. Since the division uses the result of the square root, the two instructions cannot be parallelized, and so the overall latency could be approximated with latency of sqrt + latency of double division
.
Then, I thought about the following expression: (1 / x) * sqrt(x)
. It's mathematically equivalent, not sure about the effects on precision, but let's focus on performance. This time we have a double division, a square root, and a double multiplication; however, the double division does not have to wait for the result of the square root, theoretically. I would assume that this means that on a modern CPU the two instructions can be executed mostly in parallel thanks to pipelining. This means that the overall latency shouldn't be far from: max(latency of sqrt, latency of double division) + latency of multiplication
.
This is all very approximate, however from what I've found on modern intel cpu's the square root and fp division have latencies of around 10-20+ instructions, while the fp multiplication takes just ~5 cycles. Considering how there shouldn't be substantial differences between the two expressions outside of the additional multiplication and the read-after-write dependence, I expected those to be the factors that determine the relative performance, and so I thought that the multiplication approach would fare better. After all, mul is quicker than either div or sqrt.
However, I have consistently obtained better results with the first expression in my (extremely amateurish) benchmarking (execution time of 1, 2 on my laptop with an intel i5 cpu). Unfortunately, I struggle with understanding assembly in detail and I don't know how to profile it properly, so I'm stuck.
I would like to ask: why is the first expression faster in practice? Are there circumstances under which the second would perform better, or was my idea entirely misguided?
I would assume that this means that on a modern CPU the two instructions can be executed mostly in parallel thanks to pipelining.
FP Square root and divide compete for the same execution unit. And with out-of-order exec, parallelism between loop iterations can already keep the div/sqrt unit fully occupied with work to do.
An extra multiply probably steals cycles on port 0 sometimes, delaying a sqrt or div from starting in the first possible cycle, not fully utilizing the throughput of the div/sqrt unit.
If your loop bottleneck was latency, e.g. x = 1/sqrt(x)
where the next sqrt's input depends on the result of the previous iteration's div, unlike your throughput benchmark, then you'd have something to gain from your transformation because on modern CPUs div and sqrt are partially pipelined. (Not 1/cycle even for single-precision float
1.0f / sqrtf(x)
, but throughput better than latency.) See the Godbolt link below for how this compiles.
See https://uops.info/ for performance details.
why both the sqrtsd instruction and the function call?
Because you didn't compile with -fno-math-errno
, so it calls the function for invalid args to set errno. It speculatively inlines sqrtsd
for the common case, with a branch (ucomisd/ja
). With that option, it can even vectorize with sqrtpd
/divpd
, and SIMD integer increment + packed-conversion to double (Godbolt). Or the AVX versions if you also use -march=x86-64-v3
, or -march=native
on machine with at least AVX1. But without -ffast-math
it can't re-associate the sum +=
to vectorize that part with addpd
, it has to shuffle and do scalar adds, so the bottleneck can become addsd
latency.
Unfortunately GCC isn't smart enough to see that i = [1, 1000000000)
is always a valid (finite non-negative) arg for sqrt so it could omit the check.
Related:
sqrtsd/pd