assemblyarmnewtons-methodsqrtsquare-root

Does ARM frsqrts need to be used with extra fmul instructions for a Newton iteration?


In the documentation for the ARM instruction frsqrts, it says:

This instruction multiplies corresponding floating-point values in the vectors of the two source SIMD and FP registers, subtracts each of the products from 3.0, divides these results by 2.0, places the results into a vector, and writes the vector to the destination SIMD and FP register.

I interpret this as yₙ₊₁ = (3 - xyₙ)/2-and indeed the following code justifies this interpretation:

.global _main
.align 2
_main:
    fmov d0, #2.0 // Goal: Compute 1/sqrt(2)
    fmov d1, #0.5 // initial guess
    frsqrts d2, d0, d1 // first approx

    mov x0, 0
    mov x16, #1 // '1' = terminate syscall
    svc #0x80   // "supervisor call"

However, reading about the Newton iterate for the inverse square root, I see that the iteration is not yₙ₊₁ = (3 - xyₙ)/2, but rather yₙ₊₁ = yₙ(3 - xyₙ²)/2. Now, obviously I can use frsqrt in combination with other instructions to get this:

    fmov d0, #2.0 // Goal: Compute 1/sqrt(2)
    fmov d1, #0.5 // initial guess
    fmul d2, d1, d1 // initial guess squared
    frsqrts d3, d0, d2 // (3-r*r*x)/2
    fmul d4, d1, d3 // d4 = r*(3-r*r*x)/2

But is seems weird to introduce a custom instruction which only get your halfway to your goal. Am I misusing this instruction?


Solution

  • This represents an entirely conventional partitioning of the Newton-Raphson iteration for the reciprocal square root into simple RISC-like instructions.

    For example, in AMD's 3dNow! instruction set extension for x86, this is the functionality of the instruction PFRSQIT1 (full disclosure: this is something I designed [1]). This functionality does not even require underlying FMA capability: It can be implemented via a slight modification to an existing multiplier since the result is close to 1.0 when it is used as intended, that is, as part of the NR-iteration for the reciprocal square root.

    As the asker deduced, frsqrts needs to receive the square of the current estimate for the reciprocal square root as one of its source operands. Since the frsqrte instruction delivers an estimate to 1/sqrt(x) accurate to about 8 bits, computing a single-precision reciprocal square root will require two Newton-Raphson iterations. Conceptually:

         frsqrte  est0, x             // initial approximation, accurate to about 8 bits
    
         fmul     est0_sq, est0, est0 // first NR iteration for reciprocal square root
         frsqrts  tmp, est0_sq, x
         fmul     est1, tmp, est0     
    
         fmul     est1_sq, est1, est1 // second NR iteration for reciprocal square root
         frsqrts  tmp, est1_sq, x
         fmul     res, tmp, est1
    

    This instruction sequence maps directly to a sequence of corresponding intrinsics: vrsqrte_f32(), vmul_f32(), and vrsqrts_f32().


    [1] S. Oberman, F. Weber, N. Juffa, and G. Favor, "AMD 3DNow!TM Technology and the K6-2 Microprocessor." In HotChips 10, August 16-18, 1998 (online)