I stumbled upon a failing unit-test, that involves converting doubles to integers.
The actual number that is converted is 1.234 * 1000.
, and the code basically boils down to:
#include <iostream>
#include <cstdint>
int64_t deltatime(double numSeconds) {
return (int64_t) (numSeconds * 1000.0);
}
int main() {
double s = 1.234;
int64_t ms = deltatime(s);
std::cout << s << "sec -> " << ms << "ms\n";
return 0;
}
Now compiling this for x86-64 with optimization disabled, gives me:
$ g++ test.cpp && ./test
1.234000s -> 1234ms
Whereas compiling this for x86-32 (either using the -m32
flag, or natively in a x86-32 chroot/VM), gives me:
$ g++ -m32 test.cpp && ./test
1.234000s -> 1233ms
This is with g++ --version
g++ (Debian 14.2.0-7) 14.2.0
I didn't use any -O
optimization options, and none of the expressions have more than one constant, so GCC won't do compile-time evaluation (constant folding).
Now, I understand that the number 1.234
cannot be represented exactly in IEEE-754,
e.g. in single-precision float it really is 1.2339999675750732421875
, and similarly in double-precision it is 1.2339999999999999857891452848
.
Now, multiplying the actual values with 1000.0
(which can be represented exactly), should always give me 1233.9999...
, and casting this to int64_t
would actually be 1233
(rather than the naively expected 1234
).
But why-of-why do I get 1234
on x86-64 (without specifying any rounding options for the compiler)?
Is the proper solution around this just adding 0.5
to the sum (before casting to int64_t
)?
int64_t deltatime(double numSeconds) {
return (int64_t) (numSeconds * 1000.0 + 0.5);
}
GCC targeting i386 defaults to -mfpmath=387
, with FLT_EVAL_METHOD == 2
(with the excess precision for temporaries being the 80-bit x87 type, 64-bit mantissas).
GCC targeting x86-64 uses -mfpmath=sse
with FLT_EVAL_METHOD == 0
(doubles are evaluated as double, rounding each temporary result to double-precision).
-std=gnu++20
, the default is -fexcess-precision=fast
.-std=c++20
, that would set -fexcess-precision=standard
which is somewhat like -ffloat-store
for runtime variables.For single-precision float
, https://www.h-schmidt.net/FloatConverter/IEEE754.html shows 1233.9999675750732421875
rounding up to 1234.0
if you enter that value. (That's the exact product which the FP multiplier has to round to fit in the output, producing some representable float
. I got that input by manually moving the decimal place in your float value). Presumably the same thing happens with double
.
(1000.0
isn't a power of 2 so multiplying by it changes the mantissa and thus requires rounding.)
But that doesn't happen with double(1.234)
promoted to 80-bit - there are trailing zeros at the bottom of the mantissa of that 80-bit input, so multiplying by an exact 1000.0
doesn't round up all the way to 1234.0
, so it still truncates to 1233
. In GDB, I can see the product is st0 = 1233.99999999999998579 (raw 0x40099a3fffffffffff80)
If you'd started with auto s = (long double)1234.0 / 1000
and the function took a long double
, that might also round up.
Since the cast to int64_t
is in the same expression as the multiply, even strict ISO C++ rules allow FLT_EVAL_METHOD == 2
to keep excess precision there. (And even gcc -O0
still has the value in an x87 register, not storing/reloading like it would if you assigned it to a separate variable.)
As we can see from the asm output, g++ -m32
does fmulp
then (after changing the x87 rounding mode) fistp
which rounds to integer using the current rounding mode. It doesn't store/reload as a qword
(64-bit = double
) between the multiply and the truncation.
GCC by default keeps excess precision even across statements (when optimization is enabled of course), not just within statements.
Is the proper solution around this just adding 0.5 to the sum (before casting to int64_t)?
What do you consider "proper"? What do you want your real code to do in your real use-case? If you want it to truncate toward 0, adding 0.5
doesn't do that. That's closer to round-to-nearest (but different for the smallest representable float below 0.5 which will round up to 1.0 because the distance between consecutive doubles grows with larger exponents. Also for negative).
If you want rounding, there are functions like long lrint(double)
or long long lrint(double)
which can (hopefully) compile to an SSE2 cvtsd2si
, although you might need -fno-math-errno
for GCC to inline it. Basically nobody checks errno
after calling math functions like sqrt
or nearbyint
.
Whatever you choose, you need to accept that fact that tiny differences in rounding can push a value across a threshold to round to a different integer. Different choices change where the cutoffs are. If you don't like binary floating-point math, don't use it. e.g. use a decimal floating or fixed-point type or something.