If I take this code
#include <cmath>
void compute_sqrt(const double* x, double* y, int n) {
int i;
#pragma omp simd linear(i)
for (i=0; i<n; ++i) {
y[i] = std::sqrt(x[i]);
}
}
and compile with g++ -S -c -O3 -fopenmp-simd -march=cascadelake
, then I get instructions like this in the loop (compiler-explorer)
...
vsqrtsd %xmm0, %xmm0, %xmm0
...
XMMs are 128 bit registers but cascadelake supports avx-512. Is there a way to get gcc to use 256 (YMM) or 512 bit (ZMM) registers?
By comparison, ICC defaults to use 256 registers for cascadelake: Compiling with icc -c -S -O3 -march=cascadelake -qopenmp-simd
produces (compiler-explorer)
...
vsqrtpd 32(%rdi,%r9,8), %ymm1 #7.12
...
and you can add the option -qopt-zmm-usage=high
to use 512-bit registers (compiler-explorer)
...
vrsqrt14pd %zmm4, %zmm1 #7.12
...
XMMs are 128 bit registers
It's worse than that, vsqrtsd
is not even a vector operation, as indicated by the sd
on the end (scalar, double precision). XMM registers are also used by scalar floating point operations like that, but only the low 64 or 32 bits of the register contain useful data, the rest is zeroed out.
The missing options are -fno-math-errno
(this flag is also implied by -ffast-math
, which has additional effects) and (optionally) -mprefer-vector-width=512
.
-fno-math-errno
turns off setting errno
for math operations, in particular for square roots this means a negative input results in NaN without setting errno
to EDOM
. ICC apparently does not care about that by default.
-mprefer-vector-width=512
makes autovectorization prefer 512bit operations when they make sense. By default, 256bit operations are preferred, at least for cascadelake
and skylake-avx512
and other current processors, it probably won't stay that way for all future processors.