I have this piece of C code that does stochasting rounding of a binary64 value to a binary32. Problem is I don't quite fully understand the code. I know it operates directly on the bits of a floating-point number, but I can't grasp what's happening. Could you please share some insight with me?
float function(double x){
uint64_t temp = *(uint64_t*)&x;
uint32_t r = (rand() * (0xFFFFFFFF/RAND_MAX)) % 0x1FFFFFFF;
temp += r;
temp = temp & 0xFFFFFFFFE0000000;
return (float)*(double *)&temp;
}
What do the bitmasks represent? (my intuition tells me it has to do with how the exponent and mantissa are represented in binary format but i can't visualise it)
Why is the random variable r calculated like that?
How would an interation through the code look like?
uint64_t temp = (uint64_t)&x;
This is a bad attempt to get the bits that represent the double
x
. It is bad because it violates C’s aliasing rule (C 2018 6.5 7). Correct code would be uint64_t temp; memcpy(&temp, &x, sizeof temp);
or uint64_t temp = (union { double d; uint64_t u; }) { x } .d;
. The former copies the bytes of x
into temp
, and the latter uses a compound literal to create a temporary object that is a union, which it uses to reinterpret the bits. Both of these are supported by the C standard.
uint32_t r = (rand() * (0xFFFFFFFF/RAND_MAX)) % 0x1FFFFFFF;
* (0xFFFFFFFF/RAND_MAX))
attempts to scale the result of rand
to the interval [0, FFFFFFFF16]. It may do so imperfectly. Then % 0x1FFFFFFF
reduces this to the interval [0, 1FFFFFFF16). Note the closing )
versus ]
—this is a half-open interval that does not include 1FFFFFFF16. There are some issues here:
& 0x1FFFFFFF
would have cleanly extracted the low 29 bits, producing a result in the fully closed interval [0, 1FFFFFFF16]. Using %
has a different result without apparent mathematical purpose and forces a time-consuming division.%
or &
, there is no apparent reason for scaling to FFFFFFFF16 first; one might have gone directly for the desired final interval.
temp += r;
This adds the random number to the low bits of the double
. Some of the time, it will cause a carry into the upper bits. (If the upper bits are all 1s, this can also carry into the exponent field.)
temp = temp & 0xFFFFFFFFE0000000;
This clears the low 29 bits. In the IEEE-754 binary32 and binary64 formats commonly used for float
and double
, the float
significand has 24 bits (with 23 encoded in the main significand field), and the double
significand has 53 bits (52 encoded in the main significand field), so the difference is 29. Thus, clearing the low 29 bits in the encoding of a double
will produce a number exactly representable as a float
, if the exponent is within float
range.
The goal of clearing these bits is likely to prevent a second rounding upward during the conversion to float
, coming below. The add in the previous line, temp += r;
, may have caused a carry into the upper bits of the significand, so the intent is likely to ensure the number is only ever increased by one unit, not two.
return (float)*(double *)&temp;
As with the initial line above, this is a bad attempt to reinterpret the bits as a double
. (After which it is cast to float
, which is unnecessary from standard C alone since the operand of a return
statement is automatically converted to the return type of the function, but, if strict code checking is used, it could silence a warning about a narrowing conversion.) Correct code would be memcpy(&x, &temp, sizeof x); return x;
or return (union { uint64_t u; double d }) { temp } .u;
.