floating-pointieee-754

Why do we need both a round bit and a sticky bit in IEEE 754 floating point implementations?


In my university lecture we just learnt about IEEE 754 arithmetic using the following table:

Guard Round Sticky Result
0 x x Round down (do nothing to significand)
1 1 x Round up
1 0 1 Round up
1 0 0 Round significand to even low digit

As one can see in the table above the round bit and sticky bit could just be unified into one (that being if one of the two is true; the unified one is true), which would yield the same results.

So my question thus is: Why do we need both?


Solution

  • Three bits are needed because a normalization in subtraction can cause a shift left, leaving only two bits to indicate rounding. Note that guard, round, and sticky bits are a feature of implementations of addition and subtraction; they are not specified in IEEE 754.

    Consider this subtraction in a format with four-bit significands:

     1.000×25
    −1.001×21
    

    Our addition/subtraction hardware has three extra bits, to be the guard, round, and sticky bits:

           GRS
     1.000 000×25
    −1.001 000×21
    

    We start by shifting the second operand right to have the same exponent as the first. Bits shift through the guard and round positions normally, but, once any 1 bit shifts into the sticky position, that position stays 1 for any further right shifts. So we have:

           GRS
     1.000 000×25
    −0.000 101×25
    

    Then we subtract:

           GRS
     1.000 000×25
    −0.000 101×25
    
    0.111 011×25

    This result is not normalized (it does not start with 1), so we need to shift it left, giving:

     1.110 11 ×24
    

    That shift left is what the guard bit was guarding against. Now the remaining two bits tell us to round up. If we did not have the third bit, we would have only the single 1 bit after the significand, which would represent exactly ½ the LSB and be insufficient to distinguish between the less than ½, exactly ½, and greater than ½ cases.

    Note a subtraction can require more than one bit of left shift, as in:

           GRS
     1.000 000×25
    −0.111 100×25
    
    0.000 100×25

    However, this occurs only if the two operands differed by at most one in the exponent, in which case there will have been at most one shift into the guard, round, and sticky bits, so all further bits are known to be zero, so we do not need additional hardware to record them.

    (I adapted this example from this course handout by David A. Wood and Ramkumar Ravikumar.)