c++cmathapproximationfixed-point

Fast inverse square root using fixed point instead of floating point


I am trying to implement Fast Inverse Square Root for a fixed point number, but I'm not getting anywhere.

I am trying to follow exactly the same principle as the article, except instead of writing the number in the floating point format x = (-1) ^ s * (1 + M) * 2 ^ (E-127), I am using the format x = M * 2 ^ -16, which is a 32-bit fixed point number with 16 decimal bits and 16 fractional bits.

The problem is that I cannot find the value of the "magic constant". According to my calculations, it doesn’t exist, but I’m not a mathematician and I think I’m doing everything wrong.

To solve Y = 1 / sqrt (x), I used the following reasoning (I don't know if it is correct).

In the original code we have that Y0 for approximation of newton is given by:

i = 0x5f3759df - (i >> 1);

Which means that we will have as a result a floating point number given by:

y0 = (1 + R2 - M / 2) * 2 ^ (R1 - E / 2);

This is because the operation >> divides exponent and mantissa by 2, and then we perform a subtraction of the numbers as integers.

Following the steps shown in the article, I set the format of x to:

x = M * 2 ^ -16

In an attempt to perform the same logic, I try to define Y0 for:

Y0 = (R2 - M / 2) * 2 ^ (R1 - (-16/2));

I'm trying to find a number, which can minimize the error given by:

error = (Y - Y0) / Y

Regardless of the value of R1, I can do shift operations to correct the exponent value of my final result, having the correct result at a fixed point.

Where am I wrong?


Solution

  • It can't be done.

    The fast inverse sqrt is due to the floating point representation, that has already split the number into powers of two (exponent) and the significant.

    It can be done.

    With the same tricks as done for floating points, it's possible to convert your fixed point into 2^exp * x. Given uint32_t a, uint8_t exp = bias- builtin_count_leading_zeros(a); uint32_t b = a << exp, with the constants (and domain of a) so carefully chosen, that there will be no underflows or overflows.

    Thus, you will actually have a custom floating point representation, which is tailored for this specific purpose, omitting the sign bit at least and having the best possible number of bits for the exponent, which might as well be 8.