I'm wanting to find the number of mantissa digits and the unit round-off on a particular computer. I have an understanding of what these are, just no idea how to find them - though I understand they can vary from computer to computer.
I need this number in order to perform certain aspects of numerical analysis, like analyzing errors.
What I'm currently thinking is that I could write a small c++ program to slowly increment a number until overflow occurs, but I'm not sure what type of number to use.
Am I on the right track? How exactly does one go about calculating this?
I would think that whatever language you were using would specify how floats were stored. I know Java does this by use of a specific IEEE standard (754, I think).
If it's not specified, I would think you could just do your own check by adding 0.5 to 1 to see if the actual number changes. If it does, then add 0.25 to 1, the 0.125 to 1, and so on until the number doesn't change, something like:
float a = 1;
float b = 0.5;
int bits = 0;
while (a + b != a) {
bits = bits + 1;
b = b / 2;
}
If you only had 3 mantissa bits, then 1 + 1/16 would be equal to 1.
Then you've exhausted your mantissa bits.
You might actually need the base number to be 2 rather than 1, since IEEE754 uses an implied '1+' at the start.
EDIT:
It appears the method described above may have some issues as it gives 63 bits for a system that clearly has 4-byte floats.
Whether that's to do with intermediate results (I doubt it since the same code with explicit casts [while (((float)(a + b) != (float)(a))
] has similar problems) or (more likely, I believe) the possibility that the unit value a
can be represented with bits closer to the fractional b
by adjusting the exponent, I don't yet know.
For now, it's best to rely on the language information I mentioned above such as use of IEEE754 (if that information is available).
I'll leave the problematic code in as a trap for wary players. Maybe someone with more floating point knowledge then I can leave a note explaining why it acts strangely (no conjecture, please :-).
EDIT 2:
This piece of code fixes it by ensuring intermediates are stored in floats. Turns out Jonathan Leffler was right - it was intermediate results.
#include <stdio.h>
#include <float.h>
int main(void) {
float a = 1;
float b = 0.5;
float c = a + b;
int bits = 1;
while (c != a) {
bits = bits + 1;
b = b / 2;
c = a + b;
}
printf("%d\n",FLT_MANT_DIG);
printf("%d\n",bits);
return 0;
}
This code outputs (24,24) to show that the calculated value matches the one in the header file.
Whilst written in C, it should be applicable to any language (specifically one where the information isn't available in a header or by virtue that it's specified in the language documentation). I only tested in C because Eclipse takes so long to start on my Ubuntu box :-).