floating-pointprecisionfixed-pointsingle-precision

Fixed-point instead of floating point


How many bits does fixed-point number need to be at least as precise as floating point number? If I wanted to carry calculations in fixed-point arithmetic instead of floating-point, how many bits would I need for the calculations to be not less precise?

Single precision (32-bits) float can represent numbers as small as 2^-126 and as large as 2^127, does it mean the fixed point number has to be at least in 128.128 format? (128 bits for integer part, 128 bits for fractional part).

I understand that single precision floats can represent only range of ~7 decimal digits at a time, I'm asking about all possible values.

And what about double precision (64-bits floats), does it really take 1024.1024 format to be equally precise?


Solution

  • For single precision, you would need to store bits with values in the range [2-149, 2128) which would require a signed 128.149 fixed-point type, totaling a width of 278 bits.

    For double precision, you would need to store bits with values in the range [2-1074, 21024) which would require a signed 1024.1074 fixed-point type, totaling a width of 2099 bits.

    (Disclaimer: This all assumes I've made an even number of off-by-one errors.)