floating-pointprecisionieee-754

Exactly replicating repeated addition of floating-point values


Is there a sub-O(n) way1 of replicating repeated summation of floating-point values in IEEE 754 arithmetic?

That is, is there a faster way of exactly2 replicating the result of the following pseudocode:

f64 notQuiteFMA(f64 acc, f64 f, bigint n) {
    for (bigint i = 0; i < n; i++) {
        acc = acc + f;
    }
    return acc;
}

?3

In real arithmetic the answer is straightforward - return acc + f*n is sufficient. However, in finite precision the answer is far more complex. As one example of this, note that notQuiteFMA(0, 1, 1 << 256) is somewhere around4 2^53, not the 2^256 that one would expect in infinite precision. (This example also shows why speeding this calculation up could be desirable - counting to 2^256 is rather impractical.)

  1. I am aware that this entire thing could be technically done in O(1) time5 with a 2^128 entry lookup table with each entry having an array of 2^64 elements; kindly ignore such galactic algorithms. (Pigeonhole principle - repeated addition can not progress more than 2^64 steps before entering a loop - so 'just' store the entire prefix and the number of steps in the final loop for every initial value of acc and f.)
  2. Bit-exact given the current rounding mode (ignoring the can of worms that is non-signaling NaNs at least).
  3. Treat this as pseudocode - i.e. the sequence of double-precision floating-point additions described. I am aware there are cases in certain programming languages where floating-point calculations can be done in higher precision / reordered to make things more SIMD-friendly / etc / etc; I am not interested in these cases for this question.
  4. 64-bit floats have a 52 bit mantissa; x + 1 == x first occurs around 2^53 when the addition is insufficient to increment the mantissa and the result is rounded down. Exact value depends on rounding mode I believe.
  5. Assuming a computational model where modulo of a bigint by a constant-bounded value is O(1), at least.

Solution

  • Conclusion

    We can implement an O(log n) solution by stepping through binades. This is enabled by the fact that all additions within a binade, possibly except the first, change the accumulating sum by the same amount, and therefore we can easily calculate exactly what sum would be produced by repeated additions until the end of the binade.

    Preliminaries

    A binade is an interval [2e, 2e+1) or (−2e+1, −2e] for some integer e. For purposes of this answer, the entire minimum exponent range (including the minimum-exponent normals and all the subnormals) will be considered a single binade.

    For a floating-point number x (a number representable in the floating-point format), define E(x) to be the exponent used in the representation of x. Given the definition of binade used this answer, E(x) denotes the binade x is in, except for its sign.

    Expressions in code format represent floating-point arithmetic. x+y is the real-number-arithmetic result of adding x and y. x+y is the floating-point result of adding x and y. x and x are used for the same value as appropriate.

    Floating-point addition is weakly monotonic for finite values: y < z implies x+y <= x+z, in any rounding mode, except when x is an infinity and y or z is an infinity of the opposite sign, resulting in a NaN.

    Discussion

    Let A0, A1, and A2 be the values of A before, between, and after two executions of A += f. Consider the case when E(A0) ≥ E(A1) = E(A2). Let g be A2A1. g is representable, either by Sterbenz’ Lemma if A1 is normal or, if A1 is subnormal, because A1, A2, and f are all subnormal (or zero) and there are no bits below the ULP of A2 to be lost in the addition. (Since g is representable, A2-A1 calculates it with no rounding error.)

    We will show that all subsequent additions of f that produce a value of A with E(A) = E(A1) increase A by the same amount, g.

    Let u be the ULP of A1. Let r be the remainder of |f| modulo u. Let s be +1 if f is positive or −1 if f is negative. (The case when f is zero is trivial.) Using a directed rounding mode (toward −∞, toward zero, or toward +∞), the rounding error in any subsequent addition that does not change E(A) is either always −r or always ur, according to the rounding method and the sign of f. Using round to nearest with ties to away, the rounding error is always −sr if r < ½u and always s(ur) otherwise. With round to nearest with ties to even, if r ≠ ½u, the rounding error is always −sr or always s(ur), as with ties to away.

    That leaves round to nearest with ties to even with r = ½u. Let b be the bit of f in the u position. In this case, the low bit of A1 must be even, because it was the result of adding f, which has remainder modulo u of ½u, to A0 (which is necessarily 0 modulo u since E(A0) ≥ E(A1)) to produce A1. In that addition, the real-number result was exactly midway between two representable values, so it was rounded to an even low digit (a multiple of 2u).

    Given the low bit of A1 is even, we have two cases:

    b is 0, so the addition of f has a real-number-arithmetic result that is 0u + ½u modulo 2u ≡ ½u, and it is rounded to an even low digit, so the rounding error is −½u if the result is positive or +½u if it is negative. This then repeats for subsequent additions producing an A with E(A) = E(A1).

    b is 1, so the addition of f has a real-number-arithmetic result that is 1u + ½u modulo 2u ≡ 1½u, and it is rounded to an even low digit, so the rounding error is +½u if the result is positive and −½u if the result is negative. This then repeats for subsequent additions producing an A with E(A) = E(A1).

    (To see why we go to the second addition within a binade before asserting all subsequent additions add the same amount, observe the addition that produced A0 may have involved bits from a prior A with a lower ULP, in which case the real-number-arithmetic result might not have been exactly midway between representable values, so A0 could have an odd low digit even using round to nearest with ties to even, and A2A1 would not equal A1A0.)

    Implementation

    Given the above, we can calculate how many additions of g, say t can performed until just before the next binade is reached. Then we can calculate the exact value of A just before exiting the current binade, A+tg. Then a single addition gives us the first A in that binade, and another addition either takes us into a new binade or gives us the information needed to perform the calculation for the current binade. Then we can repeat this for each binade until the desired number of additions is reached.

    Below is code demonstrating the concept. This code needs some improvement, per comments. I expect to work on it when time permits, which may be more than a few weeks from now. The E function in the code takes the exponent from frexp, not clamping it to the minimum exponent as described in the discussion above. This may slow the code but should not make it incorrect. I expect the code handles subnormals. Infinities and NaNs could be added via ordinary tests.

    Calculation of t, the number of additions that can be performed before exiting the current binade, could use more discussion about the precision required and how rounding might affect it.

    #include <math.h>
    #include <stdio.h>
    #include <stdlib.h>
    
    
    static float Sum0(float A, float f, int n)
    {
        if (n < 0) { f = -f, n = -n; }
        while (n--) A += f;
        return A;
    }
    
    
    //  Return the exponent of x, designating which binade it is in.
    static int E(float x)
    {
        int e;
        frexp(x, &e);
        return e;
    }
    
    
    static float Sum1(float A, float f, int n)
    {
        if (n < 0) { f = -f, n = -n; }
    
        if (n == 0) return A;
    
        while (1)
        {
            //  Record initial exponent.
            int e0 = E(A);
    
            A += f;
            if (--n == 0) return A;
    
            //  If exponent changed, continue to do another step.
            if (E(A) != e0) continue;
    
            /*  Note it is possible that A crossed zero here and is in the "same"
                binade with a different sign.  With rounding modes toward zero or
                round to nearest with ties to away, the rounding caused by adding f
                may differ.  However, if we have crossed zero, the next addition
                will put A in a new binade, and the loop will exit this iteration
                and start a new step.
            */
    
            float A0 = A;
            A += f;
            if (--n == 0) return A;
    
            //  If exponent changed, continue to do another step.
            if (E(A) != e0) continue;
    
            //  Calculate exact change from one addition.
            float g = A - A0;
    
            //  If g is zero, no future additions will change the sum.
            if (g == 0) return A;
    
            /*  Calculate how many additions can be done until just before the
                next binade.
    
                If A and f have the same sign, A is increasing in magnitude, and
                the next binade is at the next higher exponent.  Otherwise, the
                next binade is at the bottom of the current binade.
    
                We use copysign to get the right sign for the target.  .5 is used
                with ldexpf because C's specifications for frexp and ldexpf
                normalize the significand to [.5, 1) instead of IEEE-754's [1, 2).
            */
            int e1 = e0 + (signbit(A) == signbit(f));
    
            double t = floor((ldexpf(copysignf(.5, A), e1) - A) / (double) g);
    
            /*  Temporary workaround to avoid incorrectly steppping to the edge of
                a lower binade:
            */
            if (t == 0) continue;
            --t;
    
            if (n <= t) return A + n*g;
    
            A += t*g;
            n -= t;
        }
    }
    
    
    static int Check(float A, float f, int n)
    {
        float A0 = Sum0(A, f, n);
        float A1 = Sum1(A, f, n);
        return A0 != A1;
    }
    
    
    static void Test(float A, float f, int n)
    {
        float A0 = Sum0(A, f, n);
        float A1 = Sum1(A, f, n);
        if (A0 != A1)
        {
            printf("Adding %a = %.99g to %a = %.99g %d times:\n", f, f, A, A, n);
            printf("\tSum0 -> %a = %.99g.\n", A0, A0);
            printf("\tSum1 -> %a = %.99g.\n", A1, A1);
            exit(EXIT_FAILURE);
        }
    }
    
    
    int main(void)
    {
        srand(1);
    
        for (int i = 0; i < 1000000; ++i)
        {
            if (i % 10000 == 0) printf("Tested %d cases.\n", i);
            Test(rand() - RAND_MAX/2, rand() - RAND_MAX/2, rand() % (1<<20));
        }
    }