I'm trying to understand how NumPy implements rounding to nearest even when converting to a lower precision format, in this case, Float32 to Float16, specifically the case, when the number is normal in Float32, but it's rounded to a subnormal in Float16.
Link to the code: https://github.com/numpy/numpy/blob/13a5c4e569269aa4da6784e2ba83107b53f73bc9/numpy/core/src/npymath/halffloat.c#L244-L365
My understanding is as follows,
In float32, the number has the bits
31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
s | e0 | e1 | e2 | e3 | e4 | e5 | e6 | e7 | m0 | m1 | m2 | m3 | m4 | m5 | m6 | m7 | m8 | m9 | m10 | m11 | m12 | m13 | m14 | m15 | m16 | m17 | m18 | m19 | m20 | m21 | m22 |
/*
* If the last bit in the half significand is 0 (already even), and
* the remaining bit pattern is 1000...0, then we do not add one
* to the bit after the half significand. However, the (113 - f_exp)
* shift can lose up to 11 bits, so the || checks them in the original.
* In all other cases, we can just add one.
*/
if (((f_sig&0x00003fffu) != 0x00001000u) || (f&0x000007ffu)) {m
f_sig += 0x00001000u;
}
The above code is used when breaking ties to nearest even. I don't understand why in the second part of the logical OR , we bitwise AND against 0x0000'07ffu
(bits m12-m22) and not 0x0000'ffffu
(m11-m22) .
Once we've aligned the mantissa bits to be in the subnormal format for float16 (which is what the bit-shifting before this piece of code does), in the float32 number representation above we'd have m10
- m22
deciding which direction to round.
My understanding is that the second part of the OR checks whether the number is larger than the half-way, point, and if it is, then adds a one to the half-significand bit. But with the original number, isn't it only checking for a subset of the numbers that are above the half-way point? In the float16 number m9 would be the last precision that's going to remain. So we'll round up if,
m9 is 1, m10 is 1 and m11-m22 are all 0 (The first part of the OR)
m10 is 1, at least one of m11-m22 is 1 (to put the number above the half-way point)
can be simplified by adding 1 to m10, if any-of m11-m22 is 1. if m10 was already 1, the addition will bleed to m9, otherwise it'll stay unaffected. But, in the case of the NumPy code, the bits checked are m12-m22.
I'm not sure what I'm missing here. Is this a special case scenario?
I was expecting bits m11-m22 to be the ones that decide whether to add 1 and nor m12-m22.
f_sig
contains a significand-in-preparation for the binary16 result. (binary16 is the IEEE-754 name for what some people call a “half precision” floating-point format.) At this point, the code needs the significand bits in bits 22:13, because it is later going to shift them by 13 more bits, putting them in 9:0. In preparation for this, it shifted the bits according to the exponent. That shifted some bits out of f_sig
.
Now it wants to test whether the low bit of the new significand (now in bit 13) is 0, the highest of the bits below the significand (in bit 12) is 1, and all the remaining bits are 0. Some of those remaining bits are in bits 11:0 of f_sig
. But some of them may be gone. The shift according to the exponent shifted some of them out. So, to test whether those bits are 0, we look at them in the original significand in f
.
Since the exponent shift shifted out at most 11 bits, we only have to look at the low 11 bits of f
. The other bits of the original significand are still present in f_sig
.
So, in (f_sig&0x00003fffu) != 0x00001000u) || (f&0x000007ffu)
, the left operand of ||
tests the original significand bits that are f_sig
and the right operand tests the original significand bits that are in f
. There may be some overlap; the latter may test some bits that are also in f_sig
, but that does not matter.
My understanding is that the second part of the OR checks whether the number is larger than the half-way, point, and if it is, then adds a one to the half-significand bit.
No, it is not checking that. The test is true if and only if the trailing portion is not exactly ½ the least significant bit (LSB) of the new significand or the least significant bit is 1.
The reasoning is this:
f_sig += 0x00001000u;
, adds ½ the LSB, and the significand is later truncated at the LSB (f_sig >> 13
). This provides the desired rounding in most cases: Adding ½ to trailing portions less than ½ does not carry, and adding ½ to trailing portions more than ½ does carry.