cx86simdintrinsicsavx2

How to vectorise multiplication of an int8 array by an int16 constant, widening to int32 result array, in C (AVX2)


How do I vectorize this C function with AVX2?

static void propogate_neuron(const short a, const int8_t *b, int *c) {

    for (int i = 0; i < 32; ++i){
        c[i] += a * b[i];
    }

}

(Related Q&A for int8 x int8 non-widening, producing a vector of int8.)


Solution

  • GCC already auto-vectorizes that with a check for overlap. Promising that there's no overlap by using int *restrict c lets GCC remove that check, and gets clang to decide to auto-vectorize.

    However, clang widens to 32-bit and uses vpmulld which is 2 uops on Haswell and later. (Although it's fully efficient on Zen.) GCC uses vpmullw and vpmulhw to get the low and high halves of 16-bit full multiplies, and shuffles those together. (Godbolt) This is a pretty clunky strategy, especially with -march=znver2 where vpmulld is single uop.

    GCC does only have four single-uop multiply instructions, but costs a lot of shuffles to achieve it. We can do better:


    Since we only need 8x16 => 32-bit multiplies, we can instead use vpmaddwd which is single-uop on Haswell/Skylake as well as Zen. https://uops.info/table.html

    Unfortunately we can't take advantage of the add part since we need to add to a full 32-bit value. We need zeros in the high half of every pair of 16-bit elements to use it as just a 16x16 => 32-bit multiply within each 32-bit element.

    #include <immintrin.h>
    
    void propogate_neuron_avx2(const short a, const int8_t *restrict b, int *restrict c) {
       __m256i va = _mm256_set1_epi32( (uint16_t)a );    // [..., 0, a, 0, a] 16-bit elements
    
       for (int i = 0 ; i < 32 ; i+=8) {
           __m256i vb = _mm256_cvtepi8_epi32( _mm_loadl_epi64((__m128i*)&b[i]) );
           __m256i prod = _mm256_madd_epi16(va, vb);
           __m256i sum = _mm256_add_epi32(prod, _mm256_loadu_si256((const __m256i*)&c[i]));
           _mm256_storeu_si256((__m256i*)&c[i], sum);
        }
    }
    

    Godbolt:

    # clang13.0 -O3 -march=haswell
            movzx   eax, di
            vmovd   xmm0, eax                     # 0:a  16-bit halves
            vpbroadcastd    ymm0, xmm0            # repeated to every element
    
            vpmovsxbd       ymm1, qword ptr [rsi]  # xx:b 16-bit halves
            vpmaddwd        ymm1, ymm0, ymm1       # 0 + a*b in each 32-bit element
            vpaddd  ymm1, ymm1, ymmword ptr [rdx]
            vmovdqu ymmword ptr [rdx], ymm1
    
    ... repeated 3 more times, 8 elements per vector
    
            vpmovsxbd       ymm1, qword ptr [rsi + 8]
            vpmaddwd        ymm1, ymm0, ymm1
            vpaddd  ymm1, ymm1, ymmword ptr [rdx + 32]
            vmovdqu ymmword ptr [rdx + 32], ymm1
    

    If saving a uop per vector multiply makes a measurable performance difference, it might be worth the trouble of manually vectorizing in the source.

    It's a missed optimization that GCC / clang don't do this in the first place when auto-vectorizing your pure C code.

    If anyone wants to report this, leave a comment here. Otherwise I might get around to it. IDK if patterns like this are frequent enough for GCC / LLVM's optimizers to want to look for this pattern. Especially clang already makes a reasonable choice that's only sub-optimal because of CPU quirks (32x32 => 32-bit SIMD mulitplication costs more on recent Intel microarchitectures than 2x 16x16 => 32-bit with horizontal add).