precisionnumber-systems

Floating point representation in number system


enter image description here

I don't know how to tackle this question, I know about explicit, implicit and IEEE-754 Normalized representation of floating point number but how to break it into small problem. Please help me to visualize it.


Solution

  • Let's assume IEEE-754 single precision floats. In such a floating point number you've got about 7 digits of precision - after that you're into the floating-point wilderness.

    What do I mean? Well, let's say I've got a number = 7654321. This I can convert to a 32-bit floating point value, and I can get back that exact number. When numbers get bigger than that I start to lose precision - i.e. numbers fall off the end of my floating point number and get lost.

    Consider the following:

    #include <stdio.h>
    
    int main(int argc, char *argv[])
      {
      float f1 = 7654321, f2 = 987654321;
    
      printf("f1 = %f   f2 = %f\n", f1, f2);
      }
    

    When I run this I get

    f1 = 7654321.000000   f2 = 987654336.000000
    

    Hopefully you saw that and said, "Say WHAT?!?!". What happened to f2?

    As I said, 32-bit floats only have about 7 (decimal) digits of precision. If you try to put a number with more than seven digits of precision into a 32-point floating point variable you lose precision - the low-order digits get lost.

    So let's consider the values in your problem:

    A =  2.0 * 10^30
    B = -2.0 * 10^30
    C = 1.0
    

    and you're supposed to figure out what you get when you perform the calculations

    X = A + B
    X = X + C
    

    and

    Y = A + C
    Y = Y + B
    

    All right, let's start with the first. Substituting in values we get

    X = A + B = (2.0 * 10^30) + (-2.0 * 10^30)
    

    With a little luck X will now be zero. Then we have

    X = X + C
    

    So, substituting values we get

    X = 0.0 + 1.0
    

    so X should end up with 1.0.

    OK, that was kind of fun. Now let's look at the Y calculations, which are really the same as the X calculations, just rearranged a bit:

    Y = 2.0 * 10^30 + 1.0
    

    which should give us the result 2.0 * 10^30. Huh? WHY?!? Well, 2*10^30 exceeds the possible precision of a float (only 7 digits of precision can be preserved) because it represents a 30 digit number, and thus adding the value 1.0 to 2*10^30 does not change it. So at this point Y = 2.0 * 10^30. We then add B = -2.0 * 10^30 to it, and we get - yep, zero.

    So you end up with X = 1.0, Y = 0.0 even though if you performed these calculations in your head without considering the precision limitations of floating point numbers in a computer you'd get a value of 1.0 for both of them.

    The intended lesson here is that when you're dealing with floating point values the order of operations matters a great deal, and you have to consider the magnitude of the values you're working with carefully to plan your calculations so you don't end up with numeric mush.

    And BTW, here's a little program to implement your assignment:

    #include <stdio.h>
    #include <math.h>
    
    int main(int argc, char *argv[])
      {
      float A = 2.0 * pow(10, 30), B = -2.0 * pow(10, 30), C = 1.0;
      float X, Y;
      
      X = A + B;
      X = X + C;
      
      Y = A + C;
      Y = Y + B;
      
      printf("X = %f   Y = %f\n", X, Y);
      }
    

    Run it and it prints

      X = 1.000000   Y = 0.000000
    

    Online GDB here