ieee-754

IEEE Floating-Point Number Bound for (b-a)+a, where 0=<a<=b


It is a specific niche question, hence stated fully in the title:

Question

Given two non-negative numbers a and b, where a is less or equal to b, I care in whether y as per the following algorithm is less or equal to b.

Algorithm:

x = b-a;
y = x+a;

Is y<=b in general so long as 0<=a<=b ?

Remarks


Solution

  • Notation

    a and b are the values of a and b, respectively, and similarly for x and y. b is the next value representable in the floating-point format less than b, and b+ is the next value representable that is greater than b.

    Mathematics in plain font represents real-number mathematics. ba is the unrounded real-number result of subtraction. Mathematics in code font represents floating-point operations. b-a is the result of performing floating-point subtraction, equal to ba rounded to a representable value according to a selected rounding method.

    False With Round Upward

    yb is clearly false if round upward (round toward +∞) is used:

    Let b be 1 and a be ½(bb). (I assume the exponent range of the format suffices to represent a.) Then ba is not representable and is between b and b. With rounding upward, x = b-a produces b. Then x+a is between b and b+, so, with rounding upward, y = x+a produces b+. This violates yb.

    True With Round Downward or Toward Zero

    With round downward (toward −∞) or toward zero, we consider two cases: a is zero, and a is not zero (0 < ab).

    With a = 0, x = b-a produces b, and y = x+a produces b, and yb is satisfied.

    With a not zero, x = b-a produces some value less than or equal to ba, due to the rounding direction. Then y = x+a produces some value less than or equal to x+aba+a = b, so yb holds.

    False With Round To Nearest

    Using IEEE-754 binary32 for float with round-to-nearest, ties-to-even, this code produces a y greater than b:

    float b = 0x.FFFFFFp0;
    float a = 0x.0000018p0;
    float x = b-a;
    float y = x+a;
    

    Explanation: Let b be 1−2−24. (This is the representable value immediately preceding 1, equal to 0.FFFFFF16.) Let a be 1½•2−24 (equivalently 1.5•2−24, 3•2−25, or 0.000001816).

    ba = 1−2−24 − 3•2−25 = 1-5•2−25 = 0.FFFFFD816, which is not representable in binary32. (In binary, its leading 1 bit is at 2−1, and its trailing 1 bit is at 2−25, which span more than the 24 bits in the significand of the binary32 format.) It is halfway between the adjacent representable values 1−3•2−24 = 0.FFFFFD16 and 1−2•2−24 = 0.FFFFFE16. Since these are equidistant, rounding using round-to-nearest, ties-to-even produces the one with the even low digit, 0.FFFFFE16 = 1−2•2−24. So x = b-a sets x to this value.

    Now consider y = x+a. x+@a is 1−2•2−24 + 3•2−25 = 1−2•2−25 = 0.FFFFFF816. This is not representable. It is midway between the adjacent representable values 1−2−24 = 0.FFFFFF16 and 1 = 1.0000016. Again, the one with the even low digit is used, so y = x+a produces 1. Since y is 1 and b is 1−2−24, yb is false.