Why does GCC generate code that conditionally executes a SIMD implementation?

The following code produces assembly that conditionally executes SIMD in GCC 12.3 when compiled with -O3. For completeness, the code always executes SIMD in GCC 13.2 and never executes SIMD in clang 17.0.1.

#include <array>

__attribute__((noinline)) void fn(std::array<int, 4>& lhs, const std::array<int, 4>& rhs)
{
    for (std::size_t idx = 0; idx != 4; ++idx) {
        lhs[idx] = lhs[idx] + rhs[idx];
    }
}

Here is the link in godbolt.

Here is the actual assembly from GCC 12.3 (with -O3):

fn(std::array<int, 4ul>&, std::array<int, 4ul> const&):
        lea     rdx, [rsi+4]
        mov     rax, rdi
        sub     rax, rdx
        cmp     rax, 8
        jbe     .L2
        movdqu  xmm0, XMMWORD PTR [rsi]
        movdqu  xmm1, XMMWORD PTR [rdi]
        paddd   xmm0, xmm1
        movups  XMMWORD PTR [rdi], xmm0
        ret
.L2:
        mov     eax, DWORD PTR [rsi]
        add     DWORD PTR [rdi], eax
        mov     eax, DWORD PTR [rsi+4]
        add     DWORD PTR [rdi+4], eax
        mov     eax, DWORD PTR [rsi+8]
        add     DWORD PTR [rdi+8], eax
        mov     eax, DWORD PTR [rsi+12]
        add     DWORD PTR [rdi+12], eax
        ret

I am very interested to know a) the purpose of the first 5 assembly instructions and b) if there is anything that can be done to cause GCC 12.3 to emit the code of GCC 13.2 (ideally, without manually writing SSE).

Solution

It seems GCC12 is treating the class reference like it would a simple int *, in terms of whether lhs and rhs could partially overlap.

Exact overlap would be fine, if lhs[idx] is the same int as rhs[idx], we read it twice before writing it. But with partial overlap, rhs[3] for example could have been updated by one of the lhs[0..2] additions, which wouldn't happen with SIMD if we did all the loads first before any of the stores.

GCC13 knows that class objects aren't allowed to partially overlap (except for common initial sequence stuff for different struct/class types, which I think doesn't apply here). That would be UB so it can assume it doesn't happen. GCC12's code-gen is a missed optimization.

So how do we help GCC12? The usual go-to is __restrict for removing overlap checks or enabling auto-vectorization at all when the compiler doesn't want to invent checks + a fallback. In C, restrict is part of the language, but in C++ it's only an extension. (Supported by the major mainstream compilers, and you can use the preprocessor to #define it to the empty string on others.) You can use __restrict with references as well as pointers. (At least GCC and Clang accept it with no warnings at -Wall; I didn't double-check the docs to be sure this is standard.)

// downside: fn_restrict(same, same) would be UB
void fn_restrict(std::array<int, 4>&__restrict lhs, const std::array<int, 4>& rhs)
{
    for (std::size_t idx = 0; idx != 4; ++idx) {
        lhs[idx] = lhs[idx] + rhs[idx];
    }
}

Or manually read all of `lhs` before writing any of it

Since your array is small enough to fit in one SIMD register, there's no inefficiency in copying. This would be bad for array<int, 1000> or something!

// downside: only efficient for small arrays that fit in a few vector regs at most
void fn_temporary(std::array<int, 4>& lhs, const std::array<int, 4>& rhs)
{
    auto sum = lhs;    // read the possibly-aliasing data into a temporary
    for (std::size_t idx = 0; idx != 4; ++idx) {
        sum[idx] += rhs[idx];  // update the temporary
    }
    lhs = sum;   // store back, after all loads
}

Both of these compile to the same auto-vectorized asm as GCC13, with no wasted instructions (Godbolt)

# GCC12 -O3
fn_temporary(std::array<int, 4ul>&, std::array<int, 4ul> const&):
        movdqu  xmm0, XMMWORD PTR [rsi]
        movdqu  xmm1, XMMWORD PTR [rdi]
        paddd   xmm0, xmm1
        movups  XMMWORD PTR [rdi], xmm0
        ret

Promising alignment (like alignas(16) one one of the types?) could let it use paddd xmm1, [rdi], a memory source operand, without AVX.

Why does GCC generate code that conditionally executes a SIMD implementation?

Or manually read all of lhs before writing any of it

Or manually read all of `lhs` before writing any of it