Consider following float loop, compiled using -O3 -mavx2 -mfma
for (auto i = 0; i < a.size(); ++i) {
a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}
Clang done perfect job at vectorizing it. It uses 256-bit ymm registers and understands the difference between vblendps/vandps for the best performance possible.
.LBB0_7:
vcmpltps ymm2, ymm1, ymm0
vmulps ymm0, ymm0, ymm1
vandps ymm0, ymm2, ymm0
GCC, however, is much worse. For some reason it doesn't get better than SSE 128-bit vectors (-mprefer-vector-width=256 won't change anything).
.L6:
vcomiss xmm0, xmm1
vmulss xmm0, xmm0, xmm1
vmovss DWORD PTR [rcx+rax*4], xmm0
If replace it with plain array (as in guideline), gcc does vectorize it to AVX ymm.
int a[256], b[256], c[256];
auto foo (int *a, int *b, int *c) {
int i;
for (i=0; i<256; i++){
a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}
}
However I didn't find how to do it with variable-length std::vector. What sort of hint does gcc need to vectorize std::vector to AVX?
It's not std::vector
that's the problem, it's float
and GCC's usually-bad default of -ftrapping-math
that is supposed to treat FP exceptions as a visible side-effect, but doesn't always correctly do that, and misses some optimizations that would be safe.
In this case, there is a conditional FP multiply in the source, so strict exception behaviour avoids possibly raising an overflow, underflow, inexact, or other exception in case the compare was false.
GCC does that correctly in this case using scalar code: ...ss
is Scalar Single, using the bottom element of 128-bit XMM registers, not vectorized at all. Your asm isn't GCC's actual output: it loads both elements with vmovss
, then branches on a vcomiss
result before vmulss
, so the multiply doesn't happen if b[i] > c[i]
isn't true. So unlike your "GCC" asm, GCC's actual asm does I think correctly implement -ftrapping-math
.
Notice that your example which does auto-vectorize uses int *
args, not float*
. If you change it to float*
and use the same compiler options, it doesn't auto-vectorize either, even with float *__restrict a
(https://godbolt.org/z/nPzsf377b).
@273K's answer shows that AVX-512 lets float
auto-vectorize even with -ftrapping-math
, since AVX-512 masking (ymm2{k1}{z}
) suppresses FP exceptions for masked elements, not raising FP exceptions from any FP multiplies that don't happen in the C++ abstract machine.
gcc -O3 -mavx2 -mfma -fno-trapping-math
auto-vectorizes all 3 functions (Godbolt)void foo (float *__restrict a, float *__restrict b, float *__restrict c) {
for (int i=0; i<256; i++){
a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}
}
foo(float*, float*, float*):
xor eax, eax
.L143:
vmovups ymm2, YMMWORD PTR [rsi+rax]
vmovups ymm3, YMMWORD PTR [rdx+rax]
vmulps ymm1, ymm2, YMMWORD PTR [rdx+rax]
vcmpltps ymm0, ymm3, ymm2
vandps ymm0, ymm0, ymm1
vmovups YMMWORD PTR [rdi+rax], ymm0
add rax, 32
cmp rax, 1024
jne .L143
vzeroupper
ret
BTW, I'd recommend -march=x86-64-v3
for an AVX2+FMA feature-level. That also includes BMI1+BMI2 and stuff. It still just uses -mtune=generic
I think, but could hopefully in future ignore tuning things that only matter for CPUs that don't have AVX2+FMA+BMI2.
The std::vector
functions are bulkier since we didn't use float *__restrict a = avec.data();
or similar to promise non-overlap of the data pointed-to by the std::vector
control blocks (and the size isn't known to be a multiple of the vector width), but the non-cleanup loops for the no-overlap case are vectorized with the same vmulps
/ vcmpltps
/ vandps
.
See also:
-ftrapping-math
is broken and "never worked" according to GCC dev Marc Glisse. But https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54192 from 2012 proposing to make it not the default is still open.-ffast-math
, such as -fno-math-errno
which allows many functions to inline and is not a problem for normal code which doesn't check errno
after calling sqrt
or whatever!)-ffast-math
or #pragma omp simd reduction (+:my_sum_var)
, but @phuclv's answer has some good links)If the multiply in the C source happens regardless of the condition, then GCC would be allowed to vectorize it the efficient way without AVX-512 masking.
// still scalar asm with GCC -ftrapping-math which is a bug
void foo (float *__restrict a, float *__restrict b, float *__restrict c) {
for (int i=0; i<256; i++){
float prod = b[i] * c[i];
a[i] = (b[i] > c[i]) ? prod : 0;
}
}
But unfortunately GCC -O3 -march=x86-64-v3
(Godbolt with and without the default -ftrapping-math
) still makes scalar asm that only conditionally multiplies!
This is a bug in -ftrapping-math
. Not only is it too conservative, missing the chance to auto-vectorize: It's actually buggy, not raising FP exceptions for some multiplies the abstract machine (or a debug build) actually performs. Crap behaviour like this is why -ftrapping-math
is unreliable and probably shouldn't be on by default.
@Ovinus Real's answer points out GCC -ftrapping-math
could still have auto-vectorized the original source by masking both inputs instead of the output. 0.0 * 0.0
never raises any FP exceptions, so it's basically emulating AVX-512 zero-masking.
This would be more expensive and have more latency for out-of-order exec to hide, but is still much better than scalar especially when AVX1 is available, especially for small to medium arrays that are hot in some level of cache.
(If writing with intrinsics, just mask the output to zero unless you actually want to check the FP environment for exception flags after the loop.)
Doing this in scalar source doesn't lead GCC into making asm like that: GCC compiles this to the same branchy scalar asm unless you use -fno-trapping-math
. At least that's not a bug this time, just a missed optimization: this doesn't do b[i]*c[i]
when the compare is false.
// doesn't help, still scalar asm with GCC -ftrapping-math
void bar (float *__restrict a, float *__restrict b, float *__restrict c) {
for (int i=0; i<256; i++){
float bi = b[i];
float ci = c[i];
if (! (bi > ci)) {
bi = ci = 0;
}
a[i] = bi * ci;
}
}