How do I vectorize this C function with AVX2?
static void propogate_neuron(const short a, const int8_t *b, int *c) {
for (int i = 0; i < 32; ++i){
c[i] += a * b[i];
}
}
(Related Q&A for int8 x int8 non-widening, producing a vector of int8.)
GCC already auto-vectorizes that with a check for overlap. Promising that there's no overlap by using int *restrict c
lets GCC remove that check, and gets clang to decide to auto-vectorize.
However, clang widens to 32-bit and uses vpmulld
which is 2 uops on Haswell and later. (Although it's fully efficient on Zen.) GCC uses vpmullw
and vpmulhw
to get the low and high halves of 16-bit full multiplies, and shuffles those together. (Godbolt) This is a pretty clunky strategy, especially with -march=znver2
where vpmulld
is single uop.
GCC does only have four single-uop multiply instructions, but costs a lot of shuffles to achieve it. We can do better:
Since we only need 8x16 => 32-bit multiplies, we can instead use vpmaddwd
which is single-uop on Haswell/Skylake as well as Zen. https://uops.info/table.html
Unfortunately we can't take advantage of the add part since we need to add to a full 32-bit value. We need zeros in the high half of every pair of 16-bit elements to use it as just a 16x16 => 32-bit multiply within each 32-bit element.
#include <immintrin.h>
void propogate_neuron_avx2(const short a, const int8_t *restrict b, int *restrict c) {
__m256i va = _mm256_set1_epi32( (uint16_t)a ); // [..., 0, a, 0, a] 16-bit elements
for (int i = 0 ; i < 32 ; i+=8) {
__m256i vb = _mm256_cvtepi8_epi32( _mm_loadl_epi64((__m128i*)&b[i]) );
__m256i prod = _mm256_madd_epi16(va, vb);
__m256i sum = _mm256_add_epi32(prod, _mm256_loadu_si256((const __m256i*)&c[i]));
_mm256_storeu_si256((__m256i*)&c[i], sum);
}
}
# clang13.0 -O3 -march=haswell
movzx eax, di
vmovd xmm0, eax # 0:a 16-bit halves
vpbroadcastd ymm0, xmm0 # repeated to every element
vpmovsxbd ymm1, qword ptr [rsi] # xx:b 16-bit halves
vpmaddwd ymm1, ymm0, ymm1 # 0 + a*b in each 32-bit element
vpaddd ymm1, ymm1, ymmword ptr [rdx]
vmovdqu ymmword ptr [rdx], ymm1
... repeated 3 more times, 8 elements per vector
vpmovsxbd ymm1, qword ptr [rsi + 8]
vpmaddwd ymm1, ymm0, ymm1
vpaddd ymm1, ymm1, ymmword ptr [rdx + 32]
vmovdqu ymmword ptr [rdx + 32], ymm1
If saving a uop per vector multiply makes a measurable performance difference, it might be worth the trouble of manually vectorizing in the source.
It's a missed optimization that GCC / clang don't do this in the first place when auto-vectorizing your pure C code.
If anyone wants to report this, leave a comment here. Otherwise I might get around to it. IDK if patterns like this are frequent enough for GCC / LLVM's optimizers to want to look for this pattern. Especially clang already makes a reasonable choice that's only sub-optimal because of CPU quirks (32x32 => 32-bit SIMD mulitplication costs more on recent Intel microarchitectures than 2x 16x16 => 32-bit with horizontal add).