[SOLVED] How to vectorise multiplication of an int8 array by an int16 constant, widening to int32 result array, in C (AVX2)

How to vectorise multiplication of an int8 array by an int16 constant, widening to int32 result array, in C (AVX2)

How do I vectorize this C function with AVX2?

static void propogate_neuron(const short a, const int8_t *b, int *c) {

    for (int i = 0; i < 32; ++i){
        c[i] += a * b[i];
    }

}

(Related Q&A for int8 x int8 non-widening, producing a vector of int8.)

Solution

GCC already auto-vectorizes that with a check for overlap. Promising that there's no overlap by using int *restrict c lets GCC remove that check, and gets clang to decide to auto-vectorize.

However, clang widens to 32-bit and uses vpmulld which is 2 uops on Haswell and later. (Although it's fully efficient on Zen.) GCC uses vpmullw and vpmulhw to get the low and high halves of 16-bit full multiplies, and shuffles those together. (Godbolt) This is a pretty clunky strategy, especially with -march=znver2 where vpmulld is single uop.

GCC does only have four single-uop multiply instructions, but costs a lot of shuffles to achieve it. We can do better:

Since we only need 8x16 => 32-bit multiplies, we can instead use vpmaddwd which is single-uop on Haswell/Skylake as well as Zen. https://uops.info/table.html

Unfortunately we can't take advantage of the add part since we need to add to a full 32-bit value. We need zeros in the high half of every pair of 16-bit elements to use it as just a 16x16 => 32-bit multiply within each 32-bit element.

#include <immintrin.h>

void propogate_neuron_avx2(const short a, const int8_t *restrict b, int *restrict c) {
   __m256i va = _mm256_set1_epi32( (uint16_t)a );    // [..., 0, a, 0, a] 16-bit elements

   for (int i = 0 ; i < 32 ; i+=8) {
       __m256i vb = _mm256_cvtepi8_epi32( _mm_loadl_epi64((__m128i*)&b[i]) );
       __m256i prod = _mm256_madd_epi16(va, vb);
       __m256i sum = _mm256_add_epi32(prod, _mm256_loadu_si256((const __m256i*)&c[i]));
       _mm256_storeu_si256((__m256i*)&c[i], sum);
    }
}

Godbolt:

# clang13.0 -O3 -march=haswell
        movzx   eax, di
        vmovd   xmm0, eax                     # 0:a  16-bit halves
        vpbroadcastd    ymm0, xmm0            # repeated to every element

        vpmovsxbd       ymm1, qword ptr [rsi]  # xx:b 16-bit halves
        vpmaddwd        ymm1, ymm0, ymm1       # 0 + a*b in each 32-bit element
        vpaddd  ymm1, ymm1, ymmword ptr [rdx]
        vmovdqu ymmword ptr [rdx], ymm1

... repeated 3 more times, 8 elements per vector

        vpmovsxbd       ymm1, qword ptr [rsi + 8]
        vpmaddwd        ymm1, ymm0, ymm1
        vpaddd  ymm1, ymm1, ymmword ptr [rdx + 32]
        vmovdqu ymmword ptr [rdx + 32], ymm1

If saving a uop per vector multiply makes a measurable performance difference, it might be worth the trouble of manually vectorizing in the source.

It's a missed optimization that GCC / clang don't do this in the first place when auto-vectorizing your pure C code.

If anyone wants to report this, leave a comment here. Otherwise I might get around to it. IDK if patterns like this are frequent enough for GCC / LLVM's optimizers to want to look for this pattern. Especially clang already makes a reasonable choice that's only sub-optimal because of CPU quirks (32x32 => 32-bit SIMD mulitplication costs more on recent Intel microarchitectures than 2x 16x16 => 32-bit with horizontal add).