In the documentation for the ARM instruction frsqrts, it says:
This instruction multiplies corresponding floating-point values in the vectors of the two source SIMD and FP registers, subtracts each of the products from 3.0, divides these results by 2.0, places the results into a vector, and writes the vector to the destination SIMD and FP register.
I interpret this as yₙ₊₁ = (3 - xyₙ)/2-and indeed the following code justifies this interpretation:
.global _main
.align 2
_main:
fmov d0, #2.0 // Goal: Compute 1/sqrt(2)
fmov d1, #0.5 // initial guess
frsqrts d2, d0, d1 // first approx
mov x0, 0
mov x16, #1 // '1' = terminate syscall
svc #0x80 // "supervisor call"
However, reading about the Newton iterate for the inverse square root, I see that the iteration is not yₙ₊₁ = (3 - xyₙ)/2, but rather yₙ₊₁ = yₙ(3 - xyₙ²)/2. Now, obviously I can use frsqrt
in combination with other instructions to get this:
fmov d0, #2.0 // Goal: Compute 1/sqrt(2)
fmov d1, #0.5 // initial guess
fmul d2, d1, d1 // initial guess squared
frsqrts d3, d0, d2 // (3-r*r*x)/2
fmul d4, d1, d3 // d4 = r*(3-r*r*x)/2
But is seems weird to introduce a custom instruction which only get your halfway to your goal. Am I misusing this instruction?
This represents an entirely conventional partitioning of the Newton-Raphson iteration for the reciprocal square root into simple RISC-like instructions.
For example, in AMD's 3dNow! instruction set extension for x86, this is the functionality of the instruction PFRSQIT1
(full disclosure: this is something I designed [1]). This functionality does not even require underlying FMA capability: It can be implemented via a slight modification to an existing multiplier since the result is close to 1.0 when it is used as intended, that is, as part of the NR-iteration for the reciprocal square root.
As the asker deduced, frsqrts
needs to receive the square of the current estimate for the reciprocal square root as one of its source operands. Since the frsqrte
instruction delivers an estimate to 1/sqrt(x) accurate to about 8 bits, computing a single-precision reciprocal square root will require two Newton-Raphson iterations. Conceptually:
frsqrte est0, x // initial approximation, accurate to about 8 bits
fmul est0_sq, est0, est0 // first NR iteration for reciprocal square root
frsqrts tmp, est0_sq, x
fmul est1, tmp, est0
fmul est1_sq, est1, est1 // second NR iteration for reciprocal square root
frsqrts tmp, est1_sq, x
fmul res, tmp, est1
This instruction sequence maps directly to a sequence of corresponding intrinsics: vrsqrte_f32()
, vmul_f32()
, and vrsqrts_f32()
.
[1] S. Oberman, F. Weber, N. Juffa, and G. Favor, "AMD 3DNow!TM Technology and the K6-2 Microprocessor." In HotChips 10, August 16-18, 1998 (online)