floating-pointprecisionprocessoroperations

What's the difference between a single precision and double precision floating point operation?


What is the difference between a single precision floating point operation and double precision floating operation?

I'm especially interested in practical terms in relation to video game consoles. For example, does the Nintendo 64 have a 64 bit processor and if it does then would that mean it was capable of double precision floating point operations? Can the PS3 and Xbox 360 pull off double precision floating point operations or only single precision and in general use is the double precision capabilities made use of (if they exist?).


Solution

  • Note: the Nintendo 64 does have a 64-bit processor, however:

    Many games took advantage of the chip's 32-bit processing mode as the greater data precision available with 64-bit data types is not typically required by 3D games, as well as the fact that processing 64-bit data uses twice as much RAM, cache, and bandwidth, thereby reducing the overall system performance.

    From Webopedia:

    The term double precision is something of a misnomer because the precision is not really double.
    The word double derives from the fact that a double-precision number uses twice as many bits as a regular floating-point number.
    For example, if a single-precision number requires 32 bits, its double-precision counterpart will be 64 bits long.

    The extra bits increase not only the precision but also the range of magnitudes that can be represented.
    The exact amount by which the precision and range of magnitudes are increased depends on what format the program is using to represent floating-point values.
    Most computers use a standard format known as the IEEE floating-point format.

    The IEEE double-precision format actually has more than twice as many bits of precision as the single-precision format, as well as a much greater range.

    From the IEEE standard for floating point arithmetic

    Single Precision

    The IEEE single precision floating point standard representation requires a 32 bit word, which may be represented as numbered from 0 to 31, left to right.

    The value V represented by the word may be determined as follows:

    In particular,

    0 00000000 00000000000000000000000 = 0
    1 00000000 00000000000000000000000 = -0
    
    0 11111111 00000000000000000000000 = Infinity
    1 11111111 00000000000000000000000 = -Infinity
    
    0 11111111 00000100000000000000000 = NaN
    1 11111111 00100010001001010101010 = NaN
    
    0 10000000 00000000000000000000000 = +1 * 2**(128-127) * 1.0 = 2
    0 10000001 10100000000000000000000 = +1 * 2**(129-127) * 1.101 = 6.5
    1 10000001 10100000000000000000000 = -1 * 2**(129-127) * 1.101 = -6.5
    
    0 00000001 00000000000000000000000 = +1 * 2**(1-127) * 1.0 = 2**(-126)
    0 00000000 10000000000000000000000 = +1 * 2**(-126) * 0.1 = 2**(-127) 
    0 00000000 00000000000000000000001 = +1 * 2**(-126) * 
                                         0.00000000000000000000001 = 
                                         2**(-149)  (Smallest positive value)
    

    Double Precision

    The IEEE double precision floating point standard representation requires a 64 bit word, which may be represented as numbered from 0 to 63, left to right.

    The value V represented by the word may be determined as follows:

    Reference:
    ANSI/IEEE Standard 754-1985,
    Standard for Binary Floating Point Arithmetic.


    From cs.uaf.edu notes on IEEE Floating Point Standard, "Fraction" is generally referenced as Mantissa.

    The single precision IEEE FPS format is composed of 32 bits, divided into a 23 bit mantissa, M, an 8 bit exponent, E, and a sign bit, S:

    tabular688

    • The normalized mantissa, m, is stored in bits 0-22 with the hidden bit, b0, omitted.
      Thus M = m-1.

    • The exponent, e, is represented as a bias-127 integer in bits 23-30.
      Thus, E = e+127.

    • The sign bit, S, indicates the sign of the mantissa, with S=0 for positive values and S=1 for negative values.

    Zero is represented by E = M = 0.
    Since S may be 0 or 1, there are different representations for +0 and -0.