c++floating-pointportabilityieee-754bit-representation

C++ Portable Floating-Point Bit Representation?


Is there a C++ Standards compliant way to determining the structure of a 'float', 'double', and 'long double' at compile-time ( or run-time, as an alternative )?

If I assume std::numeric_limits< T >::is_iec559 == true and std::numeric_limits< T >::radix == 2, I suspect the is possible by the following rules:

with the following expressions vaguely like:

except I know

Background: I'm trying to overcome two issues portably:


Solution

  • In short, no. If std::numeric_limits<T>::is_iec559, then you know the format of T, more or less: you still have to determine the byte order. For anything else, all bets are off. (The other formats I know that are still being used aren't even base 2: IBM mainframes use base 16, for example.) The "standard" arrangement of an IEC floating point has the sign on the high order bit, then the exponent, and the mantissa on the low order bits; if you can successfully view it as an uint64_t, for example (via memcpy, reinterpret_cast or union—`memcpy is guaranteed to work, but is less efficient than the other two), then:

    for double:

    uint64_t tmp;
    memcpy( &tmp, &theDouble, sizeof( double ) );
    bool isNeg = (tmp & 0x8000000000000000) != 0;
    int  exp   = (int)( (tmp & 0x7FF0000000000000) >> 52 ) - 1022 - 53;
    long mant  = (tmp & 0x000FFFFFFFFFFFFF) | 0x0010000000000000;
    

    for `float:

    uint32_t tmp;
    memcpy( &tmp, &theFloat, sizeof( float ) );
    bool isNeg = (tmp & 0x80000000) != 0;
    int  exp   = (int)( (tmp & 0x7F800000) >> 23 ) - 126 - 24 );
    long mant  = (tmp & 0x007FFFFF) | 0x00800000;
    

    With regards to long double, it's worse, because different compilers treat it differently, even on the same machine. Nominally, it's ten bytes, but for alignment reasons, it may in fact be 12 or 16. Or just a synonym for double. If it's more than 10 bytes, I think you can count on it being packed into the first 10 bytes, so that &myLongDouble gives the address of the 10 byte value. But generally speaking, I'd avoid long double.