I have been benchmarking some fast numerical code on various compilers recently and was struck by a systematic variation in the speed with certain compilers at maximum optimisation -O2 and AVX/AVX2 code generation. I have narrowed some of it down to a curious behaviour than sets the fastest code generators apart from the rest.
Namely that with AVX code generation enabled and -O2 Clang and ICX will inline calls to fminf/fmin
as minss
whereas the rest of the pack GCC, ICC and MSVC stubbornly continue to call fminf
. They all quite happily inline fabsf/fabs
OK.
The code for an MRE is below and if I have done it right this is a link to it on Godbolt where you can try it on various compilers. Clang and Intel's ICX seem to inline it going as far back as I could find compilers to test.
#include <math.h>
#include <stdio.h>
#include <stdlib.h>
int main()
{
float y, x = 2.0;
y = 10*rand()-5;
y = fabs(y);
if (x < y) y = x;
printf("%g", y);
y = 10 * rand();
y = fminf(x,y);
printf("%g", y);
}
A summary table is as follows:
Compiler | inline fminf |
---|---|
GCC 13.2 | no |
ICC latest | no |
MSVC 2022 x64 | no |
CLANG 17.0.1 | yes |
ICX | yes |
fmin
and fmax
appear in quite a lot of numerical optimisation code so the slow down can be fairly significant. It can be worked around by defining macros FMIN
and FMAX
once you know. It would be nice if the other compilers inlined it though.
I can't think of any reason why this particular optimisation is missing in some compilers... does anyone have an explanation? fabs
is a more complex case and that does inline in all of them.
The basic issue is that the minss
instruction does not do the same thing as the fminf
function if the second operand is NaN, so a call to fminf
cannot be safely replaced by (just) that instruction.
You can use -ffast-math
to enable optimizations that may not be strictly IEEE correct (particularly in the presence of NaNs like this).