Why Numpy fabs is much slower than abs?

This Python 3.12.7 script with Numpy 2.2.4:

import numpy as np, timeit as ti

a = np.random.rand(1000).astype(np.float32)
print(f'Minimum, median and maximum execution time in us:')

for fun in ('np.fabs(a)', 'np.abs(a)'):  
  t = 10**6 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=999))
  print(f'{fun:20}  {np.amin(t):8,.3f}  {np.median(t):8,.3f}  {np.amax(t):8,.3f}')

produces these results on AMD Ryzen 7 3800X:

Minimum, median and maximum execution time in us:
np.fabs(a)               1.813     1.843     4.929
np.abs(a)                0.781     0.811     1.463

indicating that np.fabs() is more than 2x slower than np.abs(), despite the latter having more functionality. What is the reason?

Solution

fabs always calls the C math library function of the same name (or in this case, the fabsf type variation). Therefore the operation cannot be inlined or vectorized. I have verified this by injecting a custom version using LD_PRELOAD.

I've checked the source code of glibc (which just calls __builtin_fabsf(x)) and looked at the generated code with godbolt. I see no complexity (e.g. NaN handling or math exceptions) that differentiate fabs from the fastest, simple abs implementation. I assume numpy always call the library in the f… functions out of principle. Similar effects can be expected from fmin and fmax, for example (though here the implementation is actually more complex than plain min() and max().

From looking at old bug reports, it appears that the performance difference (and the signaling math behavior) are actually platform-dependent. On MIPS, abs() used to be (or still is?) slower as the compiler could not turn the generic C code into a simple bit mask due to the potential of floating point exceptions (which fabs should never do).

This also highlights that implementing fabs without compiler support is not as simple as writing x < 0 ? -x : x because -0 has to be differentiated from +0, which the comparison does not. np.abs seems to do the right thing on modern numpy, even though a simple C version would not but I have not investigated where and how the behavior is implemented for floating point types.