c++gccvectorizationcompiler-optimizationavx

Why gcc is so much worse at std::vector<float> vectorization of a conditional multiply than clang?


Consider following float loop, compiled using -O3 -mavx2 -mfma

for (auto i = 0; i < a.size(); ++i) {
    a[i] = (b[i] > c[i]) ? (b[i] * c[i]) : 0;
}

Clang done perfect job at vectorizing it. It uses 256-bit ymm registers and understands the difference between vblendps/vandps for the best performance possible.

.LBB0_7:
        vcmpltps        ymm2, ymm1, ymm0
        vmulps  ymm0, ymm0, ymm1
        vandps  ymm0, ymm2, ymm0

GCC, however, is much worse. For some reason it doesn't get better than SSE 128-bit vectors (-mprefer-vector-width=256 won't change anything).

.L6:
        vcomiss xmm0, xmm1
        vmulss  xmm0, xmm0, xmm1
        vmovss  DWORD PTR [rcx+rax*4], xmm0

If replace it with plain array (as in guideline), gcc does vectorize it to AVX ymm.

int a[256], b[256], c[256];
auto foo (int *a, int *b, int *c) {
  int i;
  for (i=0; i<256; i++){
    a[i] =  (b[i] > c[i]) ? (b[i] * c[i]) : 0;
  }
}

However I didn't find how to do it with variable-length std::vector. What sort of hint does gcc need to vectorize std::vector to AVX?

Source on Godbolt with gcc 13.1 and clang 14.0.0


Solution

  • It's not std::vector that's the problem, it's float and GCC's usually-bad default of -ftrapping-math that is supposed to treat FP exceptions as a visible side-effect, but doesn't always correctly do that, and misses some optimizations that would be safe.

    In this case, there is a conditional FP multiply in the source, so strict exception behaviour avoids possibly raising an overflow, underflow, inexact, or other exception in case the compare was false.

    GCC does that correctly in this case using scalar code: ...ss is Scalar Single, using the bottom element of 128-bit XMM registers, not vectorized at all. Your asm isn't GCC's actual output: it loads both elements with vmovss, then branches on a vcomiss result before vmulss, so the multiply doesn't happen if b[i] > c[i] isn't true. So unlike your "GCC" asm, GCC's actual asm does I think correctly implement -ftrapping-math.

    Notice that your example which does auto-vectorize uses int * args, not float*. If you change it to float* and use the same compiler options, it doesn't auto-vectorize either, even with float *__restrict a (https://godbolt.org/z/nPzsf377b).

    @273K's answer shows that AVX-512 lets float auto-vectorize even with -ftrapping-math, since AVX-512 masking (ymm2{k1}{z}) suppresses FP exceptions for masked elements, not raising FP exceptions from any FP multiplies that don't happen in the C++ abstract machine.


    gcc -O3 -mavx2 -mfma -fno-trapping-math auto-vectorizes all 3 functions (Godbolt)

    void foo (float *__restrict a, float *__restrict b, float *__restrict c) {
      for (int i=0; i<256; i++){
        a[i] =  (b[i] > c[i]) ? (b[i] * c[i]) : 0;
      }
    }
    
    foo(float*, float*, float*):
            xor     eax, eax
    .L143:
            vmovups ymm2, YMMWORD PTR [rsi+rax]
            vmovups ymm3, YMMWORD PTR [rdx+rax]
            vmulps  ymm1, ymm2, YMMWORD PTR [rdx+rax]
            vcmpltps        ymm0, ymm3, ymm2
            vandps  ymm0, ymm0, ymm1
            vmovups YMMWORD PTR [rdi+rax], ymm0
            add     rax, 32
            cmp     rax, 1024
            jne     .L143
            vzeroupper
            ret
    

    BTW, I'd recommend -march=x86-64-v3 for an AVX2+FMA feature-level. That also includes BMI1+BMI2 and stuff. It still just uses -mtune=generic I think, but could hopefully in future ignore tuning things that only matter for CPUs that don't have AVX2+FMA+BMI2.

    The std::vector functions are bulkier since we didn't use float *__restrict a = avec.data(); or similar to promise non-overlap of the data pointed-to by the std::vector control blocks (and the size isn't known to be a multiple of the vector width), but the non-cleanup loops for the no-overlap case are vectorized with the same vmulps / vcmpltps / vandps.


    See also:


    Tweaking the source to make the multiply unconditional? No

    If the multiply in the C source happens regardless of the condition, then GCC would be allowed to vectorize it the efficient way without AVX-512 masking.

    // still scalar asm with GCC -ftrapping-math which is a bug
    void foo (float *__restrict a, float *__restrict b, float *__restrict c) {
      for (int i=0; i<256; i++){
        float prod = b[i] * c[i];
        a[i] =  (b[i] > c[i]) ? prod : 0;
      }
    }
    

    But unfortunately GCC -O3 -march=x86-64-v3 (Godbolt with and without the default -ftrapping-math) still makes scalar asm that only conditionally multiplies!

    This is a bug in -ftrapping-math. Not only is it too conservative, missing the chance to auto-vectorize: It's actually buggy, not raising FP exceptions for some multiplies the abstract machine (or a debug build) actually performs. Crap behaviour like this is why -ftrapping-math is unreliable and probably shouldn't be on by default.


    @Ovinus Real's answer points out GCC -ftrapping-math could still have auto-vectorized the original source by masking both inputs instead of the output. 0.0 * 0.0 never raises any FP exceptions, so it's basically emulating AVX-512 zero-masking.

    This would be more expensive and have more latency for out-of-order exec to hide, but is still much better than scalar especially when AVX1 is available, especially for small to medium arrays that are hot in some level of cache.

    (If writing with intrinsics, just mask the output to zero unless you actually want to check the FP environment for exception flags after the loop.)

    Doing this in scalar source doesn't lead GCC into making asm like that: GCC compiles this to the same branchy scalar asm unless you use -fno-trapping-math. At least that's not a bug this time, just a missed optimization: this doesn't do b[i]*c[i] when the compare is false.

    // doesn't help, still scalar asm with GCC -ftrapping-math
    void bar (float *__restrict a, float *__restrict b, float *__restrict c) {
      for (int i=0; i<256; i++){
        float bi = b[i];
        float ci = c[i];
        if (! (bi > ci)) {
            bi = ci = 0;
        }
        a[i] = bi * ci;
      }
    }