c optimization sse avx auto-vectorization

Why won't simple code get auto-vectorized with SSE and AVX in modern compilers?

I've recently been diving deeper into x86-64 architecture and exploring the capabilities of SSE and AVX. I attempted to write a simple vector addition function like this:

void compute(const float *a, const float *b, float *c) {
    c[0] = a[0] + b[0];
    c[1] = a[1] + b[1];
    c[2] = a[2] + b[2];
    c[3] = a[3] + b[3];
}

Using both gcc and clang, I compiled with the following options:

cc -std=c23 -march=native -O3 -ftree-vectorize main.c

However, when I checked the disassembly, the output wasn’t quite what I expected in terms of vectorization:

compute:
  vmovss xmm0, dword ptr [rdi]
  vaddss xmm0, xmm0, dword ptr [rsi]
  vmovss dword ptr [rdx], xmm0
  vmovss xmm0, dword ptr [rdi + 4]
  vaddss xmm0, xmm0, dword ptr [rsi + 4]
  vmovss dword ptr [rdx + 4], xmm0
  vmovss xmm0, dword ptr [rdi + 8]
  vaddss xmm0, xmm0, dword ptr [rsi + 8]
  vmovss dword ptr [rdx + 8], xmm0
  vmovss xmm0, dword ptr [rdi + 12]
  vaddss xmm0, xmm0, dword ptr [rsi + 12]
  vmovss dword ptr [rdx + 12], xmm0
  ret

This seems like scalar code, processing one element at a time. But when I manually use intrinsics, I get the expected vectorized implementation:

#include <xmmintrin.h>

void compute(const float *a, const float *b, float *c) {
    __m128 va = _mm_loadu_ps(a);
    __m128 vb = _mm_loadu_ps(b);
    __m128 vc = _mm_add_ps(va, vb);
    _mm_storeu_ps(c);
}

As I understand it, modern processors are incredibly powerful, and SSE (introduced in 1999) and AVX (since 2011) are now standard. Yet it seems compilers don't always take full advantage of these instructions automatically, even when I explicitly enable optimizations.

It feels a bit like we've invented teleportation, but people still prefer to cross the Atlantic by boat. Is there a rational reason why modern compilers might be hesitant to generate vectorized code for something as straightforward as this?

As Barmar suggested, 4 elements might not be enough to get the benefit of using vectorization. I tried with the following and get the same deceiving results:

float a[512];
float b[512];
float c[512];

void compute() {  
    for (size_t i = 0; i < 512; i++) 
        c[i] = a[i] + b[i];
}

(On Godbolt, GCC -O3 -march=x86-64-v3 does auto-vectorize this with 256-bit AVX instructions.)

Solution

Aliasing is the problem here. The compiler cannot know whether a, b and c related memory regions can overlap or not. Compilers will sometimes generate code that checks for overlap at run-time and chooses a vectorized or scalar loop. But here with tiny arrays it's not worth the overhead of checking: it would make the function slower.¹

The restrict keyword is meant to address this issue. Here is an example working on all mainstream compilers:

void compute(const float * a, const float * b, float * restrict c)
{
    c[0] = a[0] + b[0];
    c[1] = a[1] + b[1];
    c[2] = a[2] + b[2];
    c[3] = a[3] + b[3];
}

Note restrict can also be applied to a and b, but this is not needed here, as pointed out by chtz in comments. For more information about that, please read this related post.

The provided code with global arrays does generate SIMD assembly code on Godbolt. Note the GCC code is not unrolled but this is another problem (which can be addressed with directives like #pragma GCC unroll(4)).

Footnote 1: Actually, when written as a loop instead of manually unrolled, GCC14 is very eager to vectorize with an overlap check, even for only 4 floats (1 vector) even with just -O2 -ftree-vectorize. Godbolt. Clang 19's threshold is 25 floats at -O2, but unfortunately at -O3 it's willing to fully unroll scalar up to 49 floats, only check+vectorizing at 50 with the default -mtune=generic. So GCC's cost / benefit heuristics must weight loop vectorization a lot higher even than combining loose statements into a vector operation, even for very small loop counts.

Realistically it's probably only worth it to check for overlap and vectorize with maybe 12 or 16 elements (assuming that non-overlap is the typical case and predicts well), if the code can't or didn't use restrict to promise non-overlap. Unless later code will do vector loads from the result, which could lead to store-forwarding stalls if we did scalar stores, giving more benefits to vectorizing even 4 or 8 elements.