I'm learning how to use SIMD intrinsics and autovectorization. Luckily, I have a useful project I'm working on that seems extremely amenable to SIMD, but is still tricky for a newbie like me.
I'm writing a filter for images that computes the average of 2x2 pixels. I'm doing part of the computation by accumulating the sum of two pixels into a single pixel.
template <typename T, typename U>
inline void accumulate_2x2_x_pass(
T* channel, U* accum,
const size_t sx, const size_t sy,
const size_t osx, const size_t osy,
const size_t yoff, const size_t oyoff
) {
const bool odd_x = (sx & 0x01);
size_t i_idx, o_idx;
// Should be vectorizable somehow...
for (size_t x = 0, ox = 0; x < sx - (size_t)odd_x; x += 2, ox++) {
i_idx = x + yoff;
o_idx = ox + oyoff;
accum[o_idx] += channel[i_idx];
accum[o_idx] += channel[i_idx + 1];
}
if (odd_x) {
// << 1 bc we need to multiply by two on the edge
// to avoid darkening during render
accum[(osx - 1) + oyoff] += (U)(channel[(sx - 1) + yoff]) * 2;
}
}
However, godbolt shows that my loop is not autovectorizable. (https://godbolt.org/z/qZxvof) How would I construct SIMD intrinsics to solve this issue? I have control of the alignment for accum, but not channel.
(I know there's an average intrinsic, but it's not appropriate here because I need to generate multiple mip levels and that command would cause loss of precision for the next level.)
Thanks everyone. :)
The widening case with the narrow type T
= uint8_t
or uint16_t
is probably best implemented with SSSE3 pmaddubsw
or SSE2 pmaddwd
with a multiplier of 1
. (Intrinsics guide) Those instructions are single-uop and do exactly the horizontal widening add you need more efficiently than shuffling.
If you're horizontally summing across 8 or more u8 elements, use psadbw
(sum of absolute differences) against a zeroed vector. _mm_sad_epu8
to get sums at the bottom of each 64-bit element. (A good first step if hsumming a whole vector, or summing a whole u8 array without overflow.)
If you can do so without losing precision, do the vertical add between rows first, before widening horizontal add. (e.g. 10, 12, or 14-bit pixel components in [u]int16_t
can't overflow). Load and vertical-add have 2 per clock throughput (or more) on most CPUs, vs. 1 per clock for pmadd*
only having 2-per-clock throughput on Skylake and later. And it means you only need 1x add + 1x pmadd vs. 2x pmadd + 1x add so it's a significant win even on Skylake. (For the 2nd way, both loads can fold into memory operands for pmadd, if you have AVX. For the add before pmadd way, you'll need a pure load first and then fold the 2nd load into add, so you might not save front-end uops, unless you use indexed addressing modes and they un-laminate.)
And ideally you don't need to +=
into an accumulator array, and instead can just read 2 rows in parallel and accumulator is write-only, so your loop has only 2 input streams and 1 output stream.
// SSSE3
__m128i hadd_widen8_to_16(__m128i a) {
// uint8_t, int8_t (doesn't matter when multiplier is +1)
return _mm_maddubs_epi16(a, _mm_set_epi8(1));
}
// SSE2
__m128i hadd_widen16_to_32(__m128i a) {
// int16_t, int16_t
return _mm_madd_epi16(a, _mm_set_epi16(1));
}
These port to 256-bit AVX2 directly, because the input and output width is the same. No shuffle needed to fix up in-lane packing.
Yes really, they're both _epi16
. Intel can be wildly inconsistent with intrinsic names. asm mnemonics are more consistent and easier to remember what's what. (ubsw
= unsigned byte to signed word, except that one of the inputs is signed bytes. pmaddwd
is packed multiply add word to dword, same naming scheme as punpcklwd
etc.)
The T=U case with uint16_t
or uint32_t
is a a use-case for SSSE3 _mm_hadd_epi16
or _mm_hadd_epi32
. It costs the same as 2 shuffles + a vertical add, but you need that anyway to pack 2 inputs to 1.
If you want to work around a shuffle-port bottleneck on Haswell and later, you could consider using qword shifts on the inputs and then shuffling together the result with shufps
(_mm_shuffle_ps
+ some casting). This could possibly be a win on Skylake (with 2 per clock shift throughput), even though it costs more 5 total uops instead of 3. It can run at best 5/3 cycles per vector of output instead of 2 cycles per vector if there's no front-end bottleneck.
// UNTESTED
//Only any good with AVX, otherwise the extra movdqa instructions cost a lot of front-end throughput and ROB capacity
//Only worth considering for Skylake, not Haswell (1/c shifts) or Sandybridge (2/c shuffle)
__m128i hadd32_emulated(__m128i a, __m128i b) {
__m128i a_shift = _mm_srli_epi64(a, 32);
__m128i b_shift = _mm_srli_epi64(b, 32);
a = _mm_add_epi32(a, a_shift);
b = _mm_add_epi32(b, b_shift);
__m128 combined = _mm_shuffle_ps(_mm_castsi128_ps(a), _mm_castsi128_ps(b), _MM_SHUFFLE(2,0,2,0));
return _mm_castps_si128(combined);
}
Ice Lake and later Intel, and AMD, have better than 1/clock throughput for some shuffles but not all. For example, vphaddd
is 2 uops for p1 or p5 on Ice Lake or later, plus 1 uop for any of p015. But vhaddps
's 2 shuffle uops can only run on port 5, even though you can precisely emulate vhaddps
with 2x vshufps
(with _MM_SHUFFLE(2,0,2,0)
for evens and _MM_SHUFFLE(3,1,3,1)
for odds, as shown in another answer), and vshufps
runs on p1/p5 (unlike vunpcklps
for some reason). And AMD CPUs run vhaddps
less efficiently than you'd expect as well. So if you have a lot of pairwise summing of FP data, you might consider manually using shuffles that clang won't "optimize" into vhaddps
.
But for integer vphaddd
on Ice Lake / Alder Lake and Zen 3 / Zen 4, things seem ok. At least with YMM vectors; Zen 3/4 at least take an extra uop for vphaddd xmm, xmm, xmm
according to https://uops.info/ measurements.
For an AVX2 version you'd need a lane-crossing shuffle to fixup a vphadd
result. So emulating hadd with shifts might be a bigger win.
// 3x shuffle 1x add uops
__m256i hadd32_avx2(__m256i a, __m256i b) {
__m256i hadd = _mm256_hadd_epi32(a, b); // 2x in-lane hadd
return _mm256_permute4x_epi64( hadd, _MM_SHUFFLE(3,1,2,0) );
}
// UNTESTED
// 2x shift, 2x add, 1x blend-immediate (any ALU port), 1x shuffle
__m256i hadd32_emulated_avx2(__m256i a, __m256i b)
{
__m256i a_shift = _mm256_srli_epi64(a, 32); // useful result in the low half of each qword
__m256i b_shift = _mm256_slli_epi64(b, 32); // ... high half of each qword
a = _mm256_add_epi32(a, a_shift);
b = _mm256_add_epi32(b, b_shift);
__m256i blended = _mm256_blend_epi32(a,b, 0b10101010); // alternating low/high results
return _mm256_permutevar8x32_epi32(_mm256_set_epi32(7,5,3,1, 6,4,2,0), blended);
}
On Haswell and Skylake, hadd32_emulated_avx2
can run at 1 per 2 clocks (saturating all vector ALU ports). The extra add_epi32
to sum into accum[]
will slow it down to at best 7/3 cycles per 256-bit vector of results, and you'll need to unroll (or use a compiler that unrolls) to not just bottleneck on the front-end.
hadd32_avx2
can run at 1 per 3 clocks (bottlenecked on port 5 for shuffles). The load + store + extra add_epi32
uops to implement your loop can run in the shadow of that easily.
(https://agner.org/optimize/, https://uops.info/, and see https://stackoverflow.com/tags/x86/info)
AVX-512 vpermt2d
could let us skip the blend step, or we could do it with merge-masking. And/or we could vprolq
to swap 32-bit halves within 64-bit elements to get the same sum in two positions, but we'd still just need a merge-masking add to replace the blend. vpermt2d
is probably best; I think we can't avoid a shuffle-control vector, and that lets us avoid setting up any merge-masking constants.