This Python 3.12.7 script with Numpy 2.2.4:
import numpy as np, timeit as ti
a = np.random.rand(1000).astype(np.float32)
print(f'Minimum, median and maximum execution time in us:')
for fun in ('np.fabs(a)', 'np.abs(a)'):
t = 10**6 * np.array(ti.repeat(stmt=fun, setup=fun, globals=globals(), number=1, repeat=999))
print(f'{fun:20} {np.amin(t):8,.3f} {np.median(t):8,.3f} {np.amax(t):8,.3f}')
produces these results on AMD Ryzen 7 3800X:
Minimum, median and maximum execution time in us:
np.fabs(a) 1.813 1.843 4.929
np.abs(a) 0.781 0.811 1.463
indicating that np.fabs()
is more than 2x slower than np.abs()
, despite the latter having more functionality. What is the reason?
fabs
always calls the C math library function of the same name (or in this case, the fabsf
type variation). Therefore the operation cannot be inlined or vectorized. I have verified this by injecting a custom version using LD_PRELOAD
.
I've checked the source code of glibc (which just calls __builtin_fabsf(x)
) and looked at the generated code with godbolt. I see no complexity (e.g. NaN handling or math exceptions) that differentiate fabs
from the fastest, simple abs
implementation. I assume numpy always call the library in the f…
functions out of principle. Similar effects can be expected from fmin
and fmax
, for example (though here the implementation is actually more complex than plain min()
and max()
.
From looking at old bug reports, it appears that the performance difference (and the signaling math behavior) are actually platform-dependent. On MIPS, abs()
used to be (or still is?) slower as the compiler could not turn the generic C code into a simple bit mask due to the potential of floating point exceptions (which fabs
should never do).
This also highlights that implementing fabs
without compiler support is not as simple as writing x < 0 ? -x : x
because -0
has to be differentiated from +0
, which the comparison does not. np.abs
seems to do the right thing on modern numpy, even though a simple C version would not but I have not investigated where and how the behavior is implemented for floating point types.