cfloating-point-precisionepsilon

epsilon for various float values


There is FLT_MIN constant that is nearest to zero. How to get nearest to some number value?

As an example:

float nearest_to_1000 = 1000.0f + epsilon;
// epsilon must be the smallest value satisfying condition:
// nearest_to_1000 > 1000.0f

I would prefer numeric formula without using special functions.


Solution

  • C provides a function for this, in the <math.h> header. nextafterf(x, INFINITY) is the next representable value after x, in the direction toward INFINITY.

    However, if you'd prefer to do it yourself:

    The following returns the epsilon you seek, for single precision (float), assuming IEEE 754. See notes at the bottom about using library routines.

    #include <float.h>
    #include <math.h>
    
    
    /*  Return the ULP of q.
    
        This was inspired by Algorithm 3.5 in Siegfried M. Rump, Takeshi Ogita, and
        Shin'ichi Oishi, "Accurate Floating-Point Summation", _Technical Report
        05.12_, Faculty for Information and Communication Sciences, Hamburg
        University of Technology, November 13, 2005.
    */
    float ULP(float q)
    {
        // SmallestPositive is the smallest positive floating-point number.
        static const float SmallestPositive = FLT_EPSILON * FLT_MIN;
    
        /*  Scale is .75 ULP, so multiplying it by any significand in [1, 2) yields
            something in [.75 ULP, 1.5 ULP) (even with rounding).
        */
        static const float Scale = 0.75f * FLT_EPSILON;
    
        q = fabsf(q);
    
        /*  In fmaf(q, -Scale, q), we subtract q*Scale from q, and q*Scale is
            something more than .5 ULP but less than 1.5 ULP.  That must produce q
            - 1 ULP.  Then we subtract that from q, so we get 1 ULP.
    
            The significand 1 is of particular interest.  We subtract .75 ULP from
            q, which is midway between the greatest two floating-point numbers less
            than q.  Since we round to even, the lesser one is selected, which is
            less than q by 1 ULP of q, although 2 ULP of itself.
        */
        return fmaxf(SmallestPositive, q - fmaf(q, -Scale, q));
    }
    

    The following returns the next value representable in float after the value it is passed (treating −0 and +0 as the same).

    #include <float.h>
    #include <math.h>
    
    
    /*  Return the next floating-point value after the finite value q.
    
        This was inspired by Algorithm 3.5 in Siegfried M. Rump, Takeshi Ogita, and
        Shin'ichi Oishi, "Accurate Floating-Point Summation", _Technical Report
        05.12_, Faculty for Information and Communication Sciences, Hamburg
        University of Technology, November 13, 2005.
    */
    float NextAfterf(float q)
    {
        /*  Scale is .625 ULP, so multiplying it by any significand in [1, 2)
            yields something in [.625 ULP, 1.25 ULP].
        */
        static const float Scale = 0.625f * FLT_EPSILON;
    
        /*  Either of the following may be used, according to preference and
            performance characteristics.  In either case, use a fused multiply-add
            (fmaf) to add to q a number that is in [.625 ULP, 1.25 ULP].  When this
            is rounded to the floating-point format, it must produce the next
            number after q.
        */
    #if 0
        // SmallestPositive is the smallest positive floating-point number.
        static const float SmallestPositive = FLT_EPSILON * FLT_MIN;
    
        if (fabsf(q) < 2*FLT_MIN)
            return q + SmallestPositive;
    
        return fmaf(fabsf(q), Scale, q);
    #else
        return fmaf(fmaxf(fabsf(q), FLT_MIN), Scale, q);
    #endif
    }
    

    Library routines are used, but fmaxf (maximum of its arguments) and fabsf (absolute value) are easily replaced. fmaf should compile to a hardware instruction on architectures with fused multiply-add. Failing that, fmaf(a, b, c) in this use can be replaced by (double) a * b + c. (IEEE-754 binary64 has sufficient range and precision to replaced fmaf. Other choices for double might not.)

    Another alternative to the fused-multiply add would be to add some tests for cases where q * Scale would be subnormal and handle those separately. For other cases, the multiplication and addition can be performed separately with ordinary * and + operators.