I have an AVX2 vector (__m256i
) of 8 int32 values. It looks like this:
[0,a,0,b,0,c,0,d]
a
, b
, c
and d
are non-zero positive int32 values. The other 4 elements in the vector are zero.
Now I want to have this 128-bit vector (__m128i
):
[a,b,c,d]
So far I do this:
__m128i y = _mm256_extracti128_si256(_mm256_permutevar8x32_epi32(x, _mm256_setr_epi32(0, 2, 4, 6, 1, 3, 5, 7)), 0);
This results in a vpermd
instruction which is quite an expensive one (source: https://www.agner.org/optimize/instruction_tables.pdf).
Isn't there a better (=faster) way to do this? I was hoping there is some int64
to int32
cast intrinsic, but I could not find it.
You need a lane-crossing shuffle for the data-movement you want, and vpermd
isn't terrible (3c latency, 1/clock throughput on Intel P cores, better on Zen 4, worse throughput on Zen 3 and earlier.) https://uops.info/ has similar numbers to Agner Fog's data, but from fully automated testing with more detail, like separate latencies from different inputs to the output(s).
The only way you could save anything is if you could process multiple vectors at once. e.g. vpsllq
/ vpor
/ vpermd
to get a 256-bit vector of abcd | ABCD from two __m256i
inputs. If the high halves of your input elements weren't known zero, vpblendd
is just as efficient as vpor
, but slightly longer machine code including an immediate. A bit-shift (_mm256_slli_epi64
) instead of a shuffle like _mm256_bslli_epi128
can run on different ports that don't compete with shuffles.
(The 2 vectors to 1 pack instructions like vpackusdw
only go as wide as 32-bit source elements, and are in-lane anyway. If you had 32-bit inputs to pack to 16, you could use vpackusdw
+ vpermq
for every 2 input vectors.)
With AVX-512, there's vpmovqd
(with truncation, or alternate versions with signed or unsigned saturation). https://www.felixcloutier.com/x86/vpmovqd:vpmovsqd:vpmovusqd . That's still a lane-crossing shuffle, the only advantage is that it doesn't need a vector constant for the shuffle control.
With AVX-512, you can interleave two source vectors to one of the same width with vpermt2d
and a shuffle-control vector.
vpermd ymm
vpermd
isn't great on Intel E-cores (Gracemont), being 2 uops and 2c throughput (and 6c latency), but spending more instructions is probably not better even on Gracemont and/or Zen <= 3 unless the surrounding code also bottlenecks on shuffle throughput.
vextracti128
+ vshufps
could get the job done more cheaply on Zen 1 and Bulldozer-family where extracting the high 128 is very cheap because YMMs are handled in 128-bit halves. But Gracemont which does the same still has 3c latency and 1/clock throughput for vextracti128
unfortunately. (vshufps
on integer data is generally fine; some CPUs don't even have any extra bypass -forwarding latency when it's part of a dep chain between integer instructions like vpaddd
.) But this would be worse on Intel P-cores: two shuffles, one of them still being port-5 only (the extract). It might be break-even on Zen 2 and Zen 3 where vpermd
is 2 uops.
One upside would be only needing immediate shuffle controls, vs. having to load a control vector for vpermd
. But that's not much of a problem if you do this in a loop where you can load the shuffle-control once.