c floating-point bit-manipulation rounding stochastic

bitwise stochastic rounding

I have this piece of C code that does stochasting rounding of a binary64 value to a binary32. Problem is I don't quite fully understand the code. I know it operates directly on the bits of a floating-point number, but I can't grasp what's happening. Could you please share some insight with me?

float function(double x){
  uint64_t temp = *(uint64_t*)&x;
  uint32_t r = (rand() * (0xFFFFFFFF/RAND_MAX)) % 0x1FFFFFFF;
  temp += r;
  temp = temp & 0xFFFFFFFFE0000000;

  return (float)*(double *)&temp;
}

What do the bitmasks represent? (my intuition tells me it has to do with how the exponent and mantissa are represented in binary format but i can't visualise it)
Why is the random variable r calculated like that?
How would an interation through the code look like?

Solution

uint64_t temp = (uint64_t)&x;

This is a bad attempt to get the bits that represent the double x. It is bad because it violates C’s aliasing rule (C 2018 6.5 7). Correct code would be uint64_t temp; memcpy(&temp, &x, sizeof temp); or uint64_t temp = (union { double d; uint64_t u; }) { x } .d;. The former copies the bytes of x into temp, and the latter uses a compound literal to create a temporary object that is a union, which it uses to reinterpret the bits. Both of these are supported by the C standard.

uint32_t r = (rand() * (0xFFFFFFFF/RAND_MAX)) % 0x1FFFFFFF;

* (0xFFFFFFFF/RAND_MAX)) attempts to scale the result of rand to the interval [0, FFFFFFFF₁₆]. It may do so imperfectly. Then % 0x1FFFFFFF reduces this to the interval [0, 1FFFFFFF₁₆). Note the closing ) versus ]—this is a half-open interval that does not include 1FFFFFFF₁₆. There are some issues here:

This may be a typo; & 0x1FFFFFFF would have cleanly extracted the low 29 bits, producing a result in the fully closed interval [0, 1FFFFFFF₁₆]. Using % has a different result without apparent mathematical purpose and forces a time-consuming division.
With either % or &, there is no apparent reason for scaling to FFFFFFFF₁₆ first; one might have gone directly for the desired final interval.
This produces only positive results; the number will only ever be increased or unchanged in magnitude, never decreased. That might be desired, but it is unclear why. The lack of documentation on this and other points suggests a lack of code quality.

temp += r;

This adds the random number to the low bits of the double. Some of the time, it will cause a carry into the upper bits. (If the upper bits are all 1s, this can also carry into the exponent field.)

temp = temp & 0xFFFFFFFFE0000000;

This clears the low 29 bits. In the IEEE-754 binary32 and binary64 formats commonly used for float and double, the float significand has 24 bits (with 23 encoded in the main significand field), and the double significand has 53 bits (52 encoded in the main significand field), so the difference is 29. Thus, clearing the low 29 bits in the encoding of a double will produce a number exactly representable as a float, if the exponent is within float range.

The goal of clearing these bits is likely to prevent a second rounding upward during the conversion to float, coming below. The add in the previous line, temp += r;, may have caused a carry into the upper bits of the significand, so the intent is likely to ensure the number is only ever increased by one unit, not two.

return (float)*(double *)&temp;

As with the initial line above, this is a bad attempt to reinterpret the bits as a double. (After which it is cast to float, which is unnecessary from standard C alone since the operand of a return statement is automatically converted to the return type of the function, but, if strict code checking is used, it could silence a warning about a narrowing conversion.) Correct code would be memcpy(&x, &temp, sizeof x); return x; or return (union { uint64_t u; double d }) { temp } .u;.