Given a 64-bit floating point type (ieee-754), and a closed range [R0,R1] assuming both R0 and R1 are within the representable range and are not NaN or +/-inf etc.
How does one calculate the number of representable values with in the range that is better than simply looping over the range?
std::uint64_t count = 0;
while (r0 <= r1)
{
count++;
r0 = std::nextafterf(r0,std::numeric_limits<double>infinity());
}
How many values can be represented in a range when using floating point type
A cheat assumes the double
matches the common IEEE 64-bit layout, the endian and size are like (un)signed long long
or (u)int64_t
.
Mapping (via a union
) each successive double
from -INF to +INF then every increasing integer value we have the below. Note that +0 and -0 are considered the same value there. If you want +0 and -0 to differ for this count change if (y.ll < 0) { y.ll = LLONG_MIN - y.ll; }
to if (y.ll < 0) { y.ll = LLONG_MIN - y.ll - 1; }
(or something like that).
The below also returns 0 on ULP_diff(x,x)
, so you might want to add 1 to capture both end points.
#include <assert.h>
#include <limits.h>
unsigned long long ULP_diff(double x, double y) {
assert(sizeof(double) == sizeof(long long));
union {
double d;
long long ll;
unsigned long long ull;
} ux, uy;
ux.d = x;
uy.d = y;
if (ux.ll < 0) {
ux.ll = LLONG_MIN - ux.ll;
}
if (uy.ll < 0) {
uy.ll = LLONG_MIN - uy.ll;
}
unsigned long long diff;
if (ux.ll >= uy.ll) {
diff = ux.ull - uy.ull;
} else {
diff = uy.ull - ux.ull;
}
return diff;
}
The above function is very useful in assessing a rating on how close 2 double
values are to each other.
Some additional comments
It is not a numeric conversion in the usual sense. It is more like a one-to-one bit-mapping of double to 64-bit signed integer. The 2^52 has nothing to do with it. With 64-bit double there are about 2^64 bit finite values. The union here maps the ascending double values to the int64_t values from a very negative value to up near INT64_MAX
. Think of 0.0 --> 0, next_after(0, 1)
--> 1 and so on up to DBL_MAX
--> a integer value near INT64_MAX. and the negatives double map to negative int64_t.
The positive double encoding maps cleanly to positive int64_t. Negative double are more like sign-magnitude encoding and so need a x.ll = LLONG_MIN - x.ll;
to flip around for int64_t 2's complement encoding.
.d (in code) .d (decimal FP) .d (hex FP) .ull .ll
-INFINITY -inf -inf 0x8010000000000000 -9218868437227405312
-DBL_MAX -1.79769e+308 -0x1.fffffffffffffp+1023 0x8010000000000001 -9218868437227405311
-1.0 -1 -0x1p+0 0xC010000000000000 -4607182418800017408
-DBL_MIN -2.22507e-308 -0x1p-1022 0xFFF0000000000000 -4503599627370496
-DBL_TRUE_MIN -4.94066e-324 -0x1p-1074 0xFFFFFFFFFFFFFFFF -1
-0.0 -0 -0x0p+0 0x0000000000000000 0
+0.0 0 0x0p+0 0x0000000000000000 0
DBL_TRUE_MIN 4.94066e-324 0x1p-1074 0x0000000000000001 1
DBL_MIN 2.22507e-308 0x1p-1022 0x0010000000000000 4503599627370496
1.0 1 0x1p+0 0x3FF0000000000000 4607182418800017408
DBL_MAX 1.79769e+308 0x1.fffffffffffffp+1023 0x7FEFFFFFFFFFFFFF 9218868437227405311
INFINITY inf inf 0x7FF0000000000000 9218868437227405312