c++x86-64long-doublenumeric-limits128-bit

C++-23 numeric_limits<long double>::max() use case


Following the discussion in c++ long double (128-bit) precision, what's the use case of numeric_limits<long double>::max()? How can I trust that constant, for example in checking for range / overflow error?

If the feclearexcept(FE_INEXACT); happens way before reaching numeric_limits<long double>::max(), then how relevant is that constant?

    feclearexcept(FE_INEXACT);
    long double ld = 1.0L;
    for (size_t i = 2; ld < numeric_limits<long double>::max(); i++)
    {
        long double ld1 = ld * i;
        if (fetestexcept(FE_INEXACT))
        {
            cout << "Inexact with i = " << i << endl;
            break;
        }
        ld = ld1;
    }
    cout << "ld: " << ld << endl;

Output:

Inexact with i = 26
ld: 15511210043330985984000000.000000

To be relevant and not being misleading at all shouldn't it be defined at the max value of 64-bit significand in a system where it is represented by 80-bit extended precision format and without having the side effect of FE_INEXACT?

The following code snippet is to validate Peter Cordes' statement:

(1) 1 / 3 expecting FE_INEXACT:

    feclearexcept(FE_INEXACT);
    volatile long double ld2 = 1.0L, ld3 = 3.0L;
    volatile long double ld4 = ld2 / ld3;
    assert(fetestexcept(FE_INEXACT));

Caveat and pitfalls on windows MSVC:

The range check between size_t (src_T) and long double (dst_T) fails on Windows with MSVC. Both types are 64-bit. Conversion from size_t (src_T) to long double (dst_T) WILL overflow in this case. The value of limits<size_t>::max(); = 18446744073709551615; value of static_cast<size_t>(limits<long double>::max()) = 9223372036854775808. The limits<long double>::max() is way much bigger than that of size_t:

numeric_limits<size_t>::max(): 18446744073709551615
numeric_limits<long double>::max(): 179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.000000

Solution

  • numeric_limits<T>::max() is the largest finite value for that type.
    If that's not what you want to know about a type, use something else.
    Relevant use-cases for max() for an FP type include lowest_seen = std::numeric_limits<T>::max(); before looping over an array to find the minimum.

    1 ulp (one unit in the last place of the mantissa, i.e. the LSB) is much larger than 1.0 at the max exponent, so the distance between representable floats is a large power of 2. But you still have 64 bits of relative precision. (Assuming -mlong-double-80). For numbers that large, +-1 is literally a rounding error. Every finite long double has the same number of significant mantissa bits and thus the same precision (except for subnormals aka denormals).

    The whole point of floating-point as opposed to fixed-point or integer is constant relative precision over the whole range; absolute precision (distance between representable values) scales with magnitude.

    If exact integer math is important to you, perhaps you should be using extended-precision integer types, like C23 _BitInt(1024) (available in C++ as an extension in some compilers, especially Clang++. IIRC, only Clang/LLVM has bit-widths greater than 128, and then only on some ISAs like x86-64). Or GNU C/C++ __int128 (Is there a 128 bit integer in gcc?), available in GCC/ICC as well as everything LLVM-derived like Clang. But not of course MSVC.

    Is there an easy way like this constexpr to find out the maximum value of the type without FE_INEXACT side effect?

    The point you hit FE_INEXACT with factorial is different from other use cases. For example, just multiplying powers of 2 you'd never have FE_INEXACT until it overflows to infinity. Or with 1.0 / 3.0 you'd have FE_INEXACT with a number that's not big at all. Rounding is normal for most floating-point use-cases. If you want to detect it, that's what feclearexcept(); / math / fetestexcept(FE_INEXACT); is for.

    The value above which some integers aren't representable

    The largest value such that every integer between it and zero is exactly representable is pow(radix, digits) (I think).
    2^24 for float (IEEE binary32), 2^54 for double (IEEE binary64), 2^64 for 80-bit x87. numeric_limits<T>::digits includes the bit of the mantissa implied by the exponent. Or in the case of the 80-bit x87 format, stored explicitly. (https://en.cppreference.com/w/cpp/types/numeric_limits/digits)

    That's the largest value you could reach by adding 0.99 repeatedly, starting from 0.0. Eventually x + 0.99 == x and the increments will be lost to rounding error when the distance to the next representable value is 2. Before that, the rounding error will be at most 0.01, rounding up to to the next integer. (Adding 1.0 could round-to-even and get through the range of floats where the nearest representable values are 2.0 apart.) (0.99 isn't exactly representable; 1 - epsilon would be ideal. And of course unless you have decades to wait for a computer to actually do it, this isn't practical except for confirming the cutoff is where you think it is.)

    So for floating-point types, you could check is_iec559 (IEEE 754) and know that your assumptions are valid. pow(radix, digits) also assumes that the exponent range is large enough for 1 ulp = 1.0, but that's true for all mainstream FP formats you'll find on real hardware. I guess if that wasn't true, pow(radix, digits) would overflow to +Infinity if you do it in the FP type you're working with, but it's only plausible for narrow FP types with only a very few exponent bits in which case you could do the pow in float or double. Or integer with 1<<digits.

    How can I trust that constant, for example in checking for range / overflow error?

    It's not useful for checking for errors. I guess you could convert its value to another type for a compare before a conversion, to see if the value will fit. e.g. from an extended-precision integer type with many thousands of bits, like C23 _BitInt(20000).

    Other than that, max() isn't useful for such checks. Detect overflow by checking isfinite() on the result of whatever you were doing. Floating point math saturates to +-Infinity on overflow. In C++, I'd have to double check the rules for out-of-range conversions from other types to see if that's undefined behaviour, though. Comments welcome.

    The format of long double can vary depending on compiler and options (e.g. gcc -mlong-double-64/80/128 selects between IEEE binary64 same as double, or the 80-bit x87 extended precision type, or a 128-bit type.) But whatever it is, numeric_limits<long double>::max() should be the largest finite value that type can represent, else the compiler's buggy. So its numeric value can change with compiler settings, but you can trust that it matches the range of long double in the same build of your code.


    range-check before casting an FP to integer

    See Handling overflow when casting doubles to integers in C for ways to deal with it. my_double <= SIZE_MAX is not safe, because conversion of SIZE_MAX to double or long double might not be exact. It can round up, allowing 2^64 to pass the check when it's not actually in range for size_t = uint64_t.