cgccx86intrinsicsavx

Is this a gcc bug? Function returns 0 when looping an int* over elements of a __m256i


When compiled with gcc the function sum_of_squares returns 0. Am I doing anything wrong or is this a gcc bug? I know that I am not handling cases when n is not divisible by 8.

#include <stdio.h>
#include <x86intrin.h>

int sum_of_squares(int x[], int n) {
    int sum = 0;
    __m256i sum8 = _mm256_set1_epi32(0);
    for (int i = 0; i < n; i += 8) {
        __m256i x8 = _mm256_load_si256((__m256i *)&x[i]);
        x8 = _mm256_mul_epi32(x8, x8);
        sum8 = _mm256_add_epi32(x8, sum8);
    }

    int *_sum = (int *)&sum8;
    for (int i = 0; i < 8; i++) sum += _sum[i];
    return sum;
}

int main() {
    _Alignas(32) int x[16];
    for (int i = 0; i < 15; i++) {
        x[i] = i;
    }
    printf("%d", sum_of_squares(x, 16));
}

Solution

  • Pointing an int* at __m256i (a GNU C vector of long long elements) violates the strict-aliasing rule. GCC14.2 -O2 optimizes your function to return 0; because of this. (xor eax,eax / ret).

    We can verify this was the problem by using -fno-strict-aliasing and seeing non-zero results. (See both on Godbolt).
    MSVC always allows all aliasing, and it doesn't always break on GCC or Clang, so some people get the wrong idea that code like yours is safe. It isn't, as this example demonstrates.

    Unfortunately -fsanitize=undefined doesn't catch this error (or uninitialized last elem of x[]).
    But in general, when you check the asm and see your function optimized away to return 0;, that often means you either returned the wrong variable or there was compile-time-visible UB. (Sometimes compilers don't even set the return value, or even omit the ret instruction so execution falls into whatever comes next.)

    Use _mm256_storeu_si256 to int tmparr[8]; (Or align the tmp array and use _mm256_store_si256).

    Or better, see Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2 (and in general Fastest way to do horizontal SSE vector sum (or other reduction)) - shuffle the low half down to line up with the low half and do vertical SIMD adds, reducing the number of elements in half each step until you're down to one. (GCC actually optimizes your reduction loop to a series of shuffles and adds, if you avoid UB.)


    Other unrelated bugs in your test case

    As Soonts mentioned, you also want vpmulld (_mm256_mullo_epi32) non-widening 32-bit multiply rather than vpmuldq (_mm256_mul_epi32) widening signed multiply that only reads the even elements as produces a 64-bit result. Your strict-aliasing bug is why you get zero; fixing this is also necessary to get the non-zero result you want, instead of just sums of even elements (0,2, etc.).

    You should also init the 16th element (x[15]) which as Soonts points out your current code doesn't do.


    __attribute__((may_alias)) lets you typedef a version of int you can point at anything

    If you really wanted, you could use typedef int32_t aliasing_i32 __attribute__((may_alias)), which you can safely point at anything that's aligned by 4 or more, using that instead of int for your pointer type. (Or unaligned if you use __attribute__((may_alias, aligned(1)))).

    Fun fact: GCC and Clang's definitions of __m256i use may_alias, allowing you to point __m256i* at anything, but not vice versa.

    But there's no advantage to this; the array doesn't take "extra" storage; the __m256i vector optimizes into just a YMM register. And the reduction loop optimizes into shuffles anyway, but if it didn't, either way of writing this would have 32 bytes of stack space that gets written with vmovdqa and read by a scalar loop.