floating-pointprecisionrounding-errorepsilon

How is machine epsilon related to floating point rounding error?


I have read that floating points may encounter rounding due to finite number of bits in mantissa. Also, I have read that machine epsilon represents a floating point such that 1 + epsilon is equal to next minimum representable number.

a) However I am not able to connect the relationship between epsilon and rounding error. Can anyone explain in layman terms that how these are related?

b) Is epsilon only dependent on bits used for mantissa or even the bits used for exponent play an part in determining the epsilon?

c) How is 1/3 represented in floating point representation? I understand that it's 0.3(recurring) but then what's next? How will it be represented in binary form from here?


Solution

  • I have read that floating points may encounter rounding due to finite number of bits in mantissa.

    Every fixed-size numerical format may encounter errors due to a finite number of bits, including integer formats, floating-point formats, fixed-point formats, and rational formats. Real-number arithmetic has infinitely many results that can be obtained by arithmetic operations, so any format with a finite number of bits cannot represent all possible results.

    The preferred term for the fraction portion of a floating-point number is “significand.” “Mantissa” is an old term for the fraction portion of a logarithm. Mantissas are logarithmic (adding something to a mantissa multiplies the represented number). Significands are logarithmic (adding something to the significand adds to the represented number—albeit scaled by the exponent, but not affected by the original significand).

    Also, I have read that machine epsilon represents a floating point such that 1 + epsilon is equal to next minimum representable number.

    That is a common but incorrect, or at least poorly stated, definition of the machine epsilon. More on that below.

    One form of floating-point representation is Mbe, where:

    Another form of floating-point representation ±Mbe, where:

    These two forms are equivalent: Every number representable in either format is representable in the other. To change between forms for the binary32 format, divide the M of the first format by 223 (thus producing a binary numeral with one digit before its point and 23 after) and add 23 to e (thus compensating for the division and producing a new e in the required interval), or do the reverse operations for the reverse direction.

    We can freely choose which form we use, and the two forms are useful for different purposes. The latter is most commonly seen in specifications, but the former is useful for some numerical analysis, which is why I introduce it. Also, sometimes the sign is written as a power of −1: (−1)s•M•be, where the sign bit, s, is 0 or 1.

    Now we can consider rounding errors that occur in arithmetic.

    Suppose we do some operation, whether addition, subtraction, multiplication, division, or something else, and it has a real-number-arithmetic result x, and we want to fit it into the floating-point format. Let’s write x = Mbe, where M = x and e = 0. Then we can adjust this by multiplying or dividing M by b until it is as large as possible without going over the bound for M. We also adjust e to compensate for however many times we multiplied or divided by b. Then we have a new M and a new e, still with x = Mbe.

    If this new M is an integer and e is in bounds, we are done, we have x in the required form. x is representable in the floating-point format, and the operation can produce this result with no rounding error.

    Since the question is about rounding error, I will assume e is within bounds and remains so. If it goes out of bounds, that results in overflow or underflow, but I will not discuss those exceptional results.

    If M is not an integer, x cannot be represented in the floating-point format. Since M is as large as we can make it by adjusting e without going out of the bounds for the significand, there is no way to eliminate the fraction part of M and make M an integer, as the form requires. The best result we can produce is to round M to the nearest integer.

    There are several rules available for rounding, including:

    Now we are ready to answer this question:

    a) However I am not able to connect the relationship between epsilon and rounding error. Can anyone explain in layman terms that how these are related?

    How big can the rounding error be? With the first rounding rule, to-nearest, the farthest we will ever move M to make it an integer is ½. That is because, for each number between two successive integers n and n+1, either the distance from the number to n is ½ or less, or the distance from the number to n+1 is ½ or less (or both). (You can only go halfway into a forest, because, after that, you are going out.)

    Therefore, with round-to-nearest, the maximum rounding error is ½ of the unit of M. Recall that M is scaled by be, so the absolute error is at most ½•be. be is called the Unit of Least Precision (ULP), so the maximum absolute error is ½ ULP. Note that the ULP varies depending on the number represented—it is scaled by be, so the ULP of 1 is different from the ULP of 128 or the ULP of 1/256.

    With the other rounding rules, we might adjust M by an amount up to 1. For example, if the true result has a fraction portion of .99, and we are rounding toward zero, we would adjust M by .99 to get to the next integer in the direction of zero.

    The machine epsilon is the ULP of 1.

    In the first floating-point form for binary32, 1 is represented as 8,388,608•2−23. So the ULP of this is 1•2−23. In the second form, 1 is represented as 1.000000000000000000000002•20. Adjusting it upward by 1 ULP produces 1.000000000000000000000012•20, which is a difference of 0.000000000000000000000012•20, which equals 2−23.

    Now we can see why the definition in the question is wrong or poorly stated. It is often presented as the epsilon is the smallest e such that 1+e != 1, meaning it is the first number such that adding it to 1 yields the next number after 1. In that presentation, it is wrong. To see why, consider adding 1 (8,388,608•2−23) and 3•2−25. The real-number result is 8,388,608.75•2−23. To get a representable result, we have to round that to 8,388,609−23, thus getting the next representable value greater than 1. But the ULP of 1 should be 2−23, which is 4•2−25, which is greater than 3•2−25. So, by that definition, the ULP would not be the smallest number such that adding it to 1 produces the next representable number. It is not the definition we want.

    It is true that the machine epsilon is the smallest number e such that the real-number-arithmetic result of 1+e is the next representable number. But, if we are stating the definition this way, we have to be clear that it is the real-number result of 1+e that we want, not the computed floating-point result, due to the rounding above. We can say the machine epsilon is the difference between 1 and the next representable value greater than 1.

    b) Is epsilon only dependent on bits used for mantissa or even the bits used for exponent play an part in determining the epsilon?

    As you can see above, when we use the first form, where M is an integer, the ULP is just the scaling part, be. But to know what the exponent is, we have to calculate M according to the available bounds. As you saw above, to represent 1 in that form, we had to scale it, to get 8,388,608•2−23, from which we saw the ULP is 2−23. In the second form, we could use 1 directly as the significand without adjusting it, representing 1 as 1.000000000000000000000002•20. But then, to calculate the ULP, we have to figure out the position value for the low bit of M. So, either way, both the width of the significand and the value of the exponent come into play in determining the ULP.

    c) How is 1/3 represented in floating point representation? I understand that it's 0.3(recurring) but then what's next? How will it be represented in binary form from here?

    ⅓ cannot be represented in a binary floating-point format. If we scale it by the largest power of two that keeps it under 224, we get ⅓ = 11,184,810⅔•2−25. The closest we can get in the binary32 format is to round that to 11,184,811•2−25, which is 0.3333333432674407958984375.