floating-pointieee-754

How many values can be represented in a range when using 64-bit floating point type in the most efficient manner


Given a 64-bit floating point type (ieee-754), and a closed range [R0,R1] assuming both R0 and R1 are within the representable range and are not NaN or +/-inf etc.

How does one calculate the number of representable values with in the range that is better than simply looping over the range?

std::uint64_t count = 0;
while (r0 <= r1) 
{
    count++;
    r0 = std::nextafterf(r0,std::numeric_limits<double>infinity());
}

Solution

  • How many values can be represented in a range when using floating point type

    A cheat assumes the double matches the common IEEE 64-bit layout, the endian and size are like (un)signed long long or (u)int64_t.

    Mapping (via a union) each successive double from -INF to +INF then every increasing integer value we have the below. Note that +0 and -0 are considered the same value there. If you want +0 and -0 to differ for this count change if (y.ll < 0) { y.ll = LLONG_MIN - y.ll; } to if (signbit(y.ll)) { y.ll = LLONG_MIN - y.ll - 1; } (or something like that).

    The below also returns 0 on ULP_diff(x,x), so you might want to add 1 to capture both end points.

    #include <assert.h>
    #include <limits.h>
    
    unsigned long long ULP_diff(double x, double y) {
      assert(sizeof(double) == sizeof(long long));
      union {
        double d;
        long long ll;
        unsigned long long ull;
      } ux, uy;
      ux.d = x;
      uy.d = y;
      if (ux.ll < 0) {
        ux.ll = LLONG_MIN - ux.ll;
      }
      if (uy.ll < 0) {
        uy.ll = LLONG_MIN - uy.ll;
      }
      unsigned long long diff;
      if (ux.ll >= uy.ll) {
        diff = ux.ull - uy.ull;
      } else {
        diff = uy.ull - ux.ull;
      }
      return diff;
    }
    

    The above function is very useful in assessing a rating on how close 2 double values are to each other.


    Some additional comments

    It is not a numeric conversion in the usual sense. It is more like a one-to-one bit-mapping of double to 64-bit signed integer. The 2^52 has nothing to do with it. With 64-bit double there are about 2^64 bit finite values. The union here maps the ascending double values to the int64_t values from a very negative value to up near INT64_MAX. Think of 0.0 --> 0, next_after(0, 1) --> 1 and so on up to DBL_MAX --> a integer value near INT64_MAX. and the negatives double map to negative int64_t.

    The positive double encoding maps cleanly to positive int64_t. Negative double are more like sign-magnitude encoding and so need a x.ll = LLONG_MIN - x.ll; to flip around for int64_t 2's complement encoding.

    .d (in code)  .d (decimal FP) .d (hex FP)             .ull               .ll
    -INFINITY     -inf           -inf                     0x8010000000000000 -9218868437227405312
    -DBL_MAX      -1.79769e+308  -0x1.fffffffffffffp+1023 0x8010000000000001 -9218868437227405311
    -1.0          -1             -0x1p+0                  0xC010000000000000 -4607182418800017408
    -DBL_MIN      -2.22507e-308  -0x1p-1022               0xFFF0000000000000    -4503599627370496
    -DBL_TRUE_MIN -4.94066e-324  -0x1p-1074               0xFFFFFFFFFFFFFFFF                   -1
    -0.0          -0             -0x0p+0                  0x0000000000000000                    0
    +0.0           0              0x0p+0                  0x0000000000000000                    0
     DBL_TRUE_MIN  4.94066e-324   0x1p-1074               0x0000000000000001                    1
     DBL_MIN       2.22507e-308   0x1p-1022               0x0010000000000000     4503599627370496
     1.0           1              0x1p+0                  0x3FF0000000000000  4607182418800017408
     DBL_MAX       1.79769e+308   0x1.fffffffffffffp+1023 0x7FEFFFFFFFFFFFFF  9218868437227405311
     INFINITY      inf            inf                     0x7FF0000000000000  9218868437227405312