It is a specific niche question, hence stated fully in the title:
Given two non-negative numbers a
and b
, where a
is less or equal to b
, I care in whether y
as per the following algorithm is less or equal to b
.
Algorithm:
x = b-a;
y = x+a;
Is y<=b
in general so long as 0<=a<=b
?
float
and double
. I care in any IEEE-compliant number.a and b are the values of a
and b
, respectively, and similarly for x and y. b− is the next value representable in the floating-point format less than b, and b+ is the next value representable that is greater than b.
Mathematics in plain font represents real-number mathematics. b−a is the unrounded real-number result of subtraction. Mathematics in code font
represents floating-point operations. b-a
is the result of performing floating-point subtraction, equal to b−a rounded to a representable value according to a selected rounding method.
y
≤ b
is clearly false if round upward (round toward +∞) is used:
Let b be 1 and a be ½(b−b−). (I assume the exponent range of the format suffices to represent a.) Then b−a is not representable and is between b− and b. With rounding upward, x = b-a
produces b
. Then x+a is between b and b+, so, with rounding upward, y = x+a
produces b+. This violates y
≤ b
.
With round downward (toward −∞) or toward zero, we consider two cases: a
is zero, and a
is not zero (0 < a
≤ b
).
With a
= 0, x = b-a
produces b
, and y = x+a
produces b
, and y
≤ b
is satisfied.
With a
not zero, x = b-a
produces some value less than or equal to b−a, due to the rounding direction. Then y = x+a
produces some value less than or equal to x+a ≤ b−a+a = b, so y
≤ b
holds.
Using IEEE-754 binary32 for float
with round-to-nearest, ties-to-even, this code produces a y
greater than b
:
float b = 0x.FFFFFFp0;
float a = 0x.0000018p0;
float x = b-a;
float y = x+a;
Explanation: Let b be 1−2−24. (This is the representable value immediately preceding 1, equal to 0.FFFFFF16.) Let a be 1½•2−24 (equivalently 1.5•2−24, 3•2−25, or 0.000001816).
b−a = 1−2−24 − 3•2−25 = 1-5•2−25 = 0.FFFFFD816, which is not representable in binary32. (In binary, its leading 1 bit is at 2−1, and its trailing 1 bit is at 2−25, which span more than the 24 bits in the significand of the binary32 format.) It is halfway between the adjacent representable values 1−3•2−24 = 0.FFFFFD16 and 1−2•2−24 = 0.FFFFFE16. Since these are equidistant, rounding using round-to-nearest, ties-to-even produces the one with the even low digit, 0.FFFFFE16 = 1−2•2−24. So x = b-a
sets x
to this value.
Now consider y = x+a
. x+@a is 1−2•2−24 + 3•2−25 = 1−2•2−25 = 0.FFFFFF816. This is not representable. It is midway between the adjacent representable values 1−2−24 = 0.FFFFFF16 and 1 = 1.0000016. Again, the one with the even low digit is used, so y = x+a
produces 1. Since y
is 1 and b
is 1−2−24, y
≤ b
is false.