c++floating-pointtype-conversionieee-754

Lossy conversion between long double and double


I was cought by suprise that the following code returns false for gcc 13 and clang 18. Why does this happen? Isn't the number 8.1 representable in both formats?

#include <iostream>
#include <iomanip>
int main()
{
    const long double val = 8.1L;
    const double val2 = static_cast<double>(val);
    const long double val3 = static_cast<long double>(val2);
    std::cout << std::boolalpha << (val == val3) << '\n';

    return 0;
}

Solution

  • Isn't the number 8.1 representable in both formats?

    No, 8.1 is not representable in any floating-point format that uses two for its base.

    If a number is representable in a floating-point format, it is representable in a form ±Mbe where b is the base used for the format and M and e are integers. There are bounds on M and e that depend on the format, and the M portion is often written as a fixed-point number instead of an integer, but this is just a scale change. Representation with an integer is mathematically equivalent.

    With base two, 8.125 can be represented as +65•2−3, because 8.125•23 = 65. However, there is no power of two we can multipy 8.1 by to get an integer. 8.1•2 = 16.2. 16.2•2 = 32.4. 32.4•2 = 64.8. 64.8•2 = 129.6. 129.6•2 = 259.2. 259.2•2 = 518.4. You can see the digit after the decimal point is looping: .2, .4, .8, .6, .2, .4, .8, .6, .2,… It will never go away.

    8.1 in binary is 1000.00011001100110011001100110011001100110011001100110011…2.

    The format commonly used for double, IEEE-754 binary64, has 53 bits for its significand. When 8.1 is converted to this format it will be rounded to 53 bits. (The result of this rounding is 8.0999999999999996447286321199499070644378662109375.)

    I do not know what format your implementation is using for long double, but it likely has considerably more than 53 bits, say p bits. When 8.1 is rounded to this format, it is rounded to p bits. Your long double variable val gets this rounded value.

    When static_cast<double>(val) is evaluated, this value is rounded to 53 bits, and val2 gets that value. It is a different value from val.