[SOLVED] Fast & accurate atan/arctan approximation algorithm

OP reported a max error of 1.6 radians (92 degrees) which is inconsistent with testing OP's code below, max error input x: 0 to 1 range of about 0.0015 radians. I suspect a mis-code or testing outside the range of 0...1. Did OP means 1.6 milli-radians?

Perhaps a more accurate a*x^5 + b*x^3 + c*x will still be fast enough for OP. It is about 4x more accurate on my machine on average and 2x better worst case. It uses an optimal 3-term polynomial as suggested by @Foon and @Matthew Pope

#include <math.h>
#include <stdio.h>

#ifndef M_PI_4
#define M_PI_4 (3.1415926535897932384626433832795/4.0)
#endif

double FastArcTan(double x) {
  return M_PI_4*x - x*(fabs(x) - 1)*(0.2447 + 0.0663*fabs(x));
}

#define A 0.0776509570923569
#define B -0.287434475393028
#define C (M_PI_4 - A - B)
#define FMT "% 16.8f"

double Fast2ArcTan(double x) {
  double xx = x * x;
  return ((A*xx + B)*xx + C)*x;
}

int main() {
  double mxe1 = 0, mxe2 = 0;
  double err1 = 0, err2 = 0;
  int n = 100;
  for (int i=-n;i<=n; i++) {
    double x = 1.0*i/n;
    double y = atan(x);
    double y_fast1 = FastArcTan(x);
    double y_fast2 = Fast2ArcTan(x);
    printf("%3d x:% .3f y:" FMT "y1:" FMT "y2:" FMT "\n", i, x, y, y_fast1, y_fast2);
    if (fabs(y_fast1 - y) > mxe1 ) mxe1  = fabs(y_fast1 - y);
    if (fabs(y_fast2 - y) > mxe2 ) mxe2  = fabs(y_fast2 - y);
    err1 += (y_fast1 - y)*(y_fast1 - y);
    err2 += (y_fast2 - y)*(y_fast2 - y);
  }
  printf("max error1: " FMT "sum sq1:" FMT "\n", mxe1, err1);
  printf("max error2: " FMT "sum sq2:" FMT "\n", mxe2, err2);
}

Output

 ...
 96 x: 0.960 y:      0.76499283y1:      0.76582280y2:      0.76438526
 97 x: 0.970 y:      0.77017091y1:      0.77082844y2:      0.76967407
 98 x: 0.980 y:      0.77529750y1:      0.77575981y2:      0.77493733
 99 x: 0.990 y:      0.78037308y1:      0.78061652y2:      0.78017777
100 x: 1.000 y:      0.78539816y1:      0.78539816y2:      0.78539816
max error1:       0.00150847sum sq1:      0.00023062
max error2:       0.00084283sum sq2:      0.00004826

Unclear why OP's code uses fabs() given "Input: x = (0, 1]".

[Update 2024]

For low precision functions such as double FastArcTan(double x), we can optimize the error at least 3 ways: error, relative error and unit in the last place error. With float x values in the range [0 ... 1.0], we can test every float and report the worst case.

Error is simply fast_f(x) - reference(x).

Relative error is (fast_f(x) - reference(x))/reference(x) with special consideration when reference(x) is 0 or a sub-normal number - using (fast_f(x) - reference(x))/FLT_MIN in those cases.

Unit in the last place (ULP) error is the the number of float away from the best answer. Example. It is approximately (fast_f(x) - reference(x))/round_down_to_a_power_of_2(reference(x)).

OP, with "gave a max error of 1.6 radians" wants to optimize the error.

In my experience, usually it is the relative error or ULP error that is desired to minimize.

This update reviews code to see if we can do better with some restrictions:

Only interested in float values in the primary range [0.0 ... 1.0].
No wider than float math coded.
Although fmaf() is useful to extend precision, it is not used here given that simple/fast trig functions are often needed on systems without a hardware fuse-multiply-add and good software ones are slow.
The end points f(0.0) and f(1.0) report the best result (e.g. 0.0, M_PI/4). If we tolerate more error in the ends, we can somewhat reduce overall errors. IMO, it is often desirable that the end-points have minimal error.

Test results:

                       :     radians,           %,         ULP
atanf (C library)      : 0.000000051, 0.000008442,       0.852
FastArcTan -OP         : 0.001508884, 5.882352941,  490209.000
Fast2ArcTan-This answer: 0.000843262, 0.961538462,   80838.000
atan_abs3-optimize err : 0.000704068, 1.041666667,   87799.000
atan_rel3-optimize rel : 0.001353447, 0.370370370,   31179.000
atan_ulp3-optimize UDP : 0.001581718, 0.315457413,   26536.822
atan_ulp9-optimize UDP : 0.000000095, 0.000012706,       1.592
atan_jw    -jw         : 0.000157066, 0.019998672,    2635.133
atan_jw_alt-optimize   : 0.000006579, 0.001461326,     111.129

atan_abs3() achieves a better result than Fast2ArcTan() for OP's goals.

I recommend atan_ulp3() for those wanting a few term low relative error.

float atan_abs3(float x) {
  const float a[3] = { //
      0.994766756708199f, -2.8543851807526100E-01f, 0.0760699247645105f};
  float xx = x * x;
  return ((a[2] * xx + a[1]) * xx + a[0]) * x;
}

float atan_rel3(float x) {
  const float a[3] = { //
      0.998141572179073f, -2.9774125603673200E-01f, 0.0849978472551071f};
  float xx = x * x;
  return ((a[2] * xx + a[1]) * xx + a[0]) * x;
}
float atan_ulp3(float x) {
  const float a[3] = { // 
      0.998418889819911f, -2.9993501171084700E-01f, 0.0869142852883849f};
  float xx = x * x;
  return ((a[2] * xx + a[1]) * xx + a[0]) * x;
}

float atan_ulp9(float x) {
  static const float a[9] = { {1.0f, -0.333331728467737f, 0.199940412794435f,
      -0.142123340834229f, 0.10668127080775f, -0.0755120841589429f,
      0.0431408641542157f, -0.0162911733512761f, 0.00289394245323327f}};
  float xx = x * x;
  float sum = 0.0;
  for (unsigned i = 9; i-- > 0;) {
    sum = sum * xx + a[i];
  }
  return sum * x;
}

float atan_jw(float x) {
  return 8 * x / (3 + sqrtf(25 + 80.0f / 3.0f * x * x));
}

float atan_jw_alt(float x) {
  return 8.430893743524f * x / (3.2105332277903100f + sqrtf(27.2515970979709f + 29.3591908371266f * x * x));
}