Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_*
and _mm_unpackhi_*
. For 4 elements in a vector, it does this:
inputs: (A0 A1 A2 A3) (B0 B1 B2 B3)
unpacklo/hi: (A0 B0 A1 B1) (A2 B2 A3 B3)
The equivalent of unpack is vzip
in ARM's NEON instruction set. However, the NEON instruction set also provides the operation vuzp
which is the inverse of vzip
. For 4 elements in a vector, it does this:
inputs: (A0 A1 A2 A3) (B0 B1 B2 B3)
vuzp: (A0 A2 B0 B2) (A1 A3 B1 B3)
How can vuzp
be implemented efficiently using SSE or AVX intrinsics? There doesn't seem to be an instruction for it. For 4 elements, I assume it can be done using a shuffle and a subsequent unpack moving 2 elements:
inputs: (A0 A1 A2 A3) (B0 B1 B2 B3)
shuffle: (A0 A2 A1 A3) (B0 B2 B1 B3)
unpacklo/hi 2: (A0 A2 B0 B2) (A1 A3 B1 B3)
Is there a more efficient solution using a single instruction? (Maybe for SSE first - I'm aware that for AVX we may have the additional problem that shuffle and unpack don't cross lanes.)
Knowing this may be useful for writing code for data swizzling and deswizzling (it should be possible to derive deswizzling code just by inverting the operations of swizzling code based on unpack operations).
Edit: Here is the 8-element version: This is the effect of NEON's vuzp
:
input: (A0 A1 A2 A3 A4 A5 A6 A7) (B0 B1 B2 B3 B4 B5 B6 B7)
vuzp: (A0 A2 A4 A6 B0 B2 B4 B6) (A1 A3 A5 A7 B1 B3 B5 B7)
This is my version with one shuffle
and one unpack
for each output element (seems to generalize to larger element numbers):
input: (A0 A1 A2 A3 A4 A5 A6 A7) (B0 B1 B2 B3 B4 B5 B6 B7)
shuffle: (A0 A2 A4 A6 A1 A3 A5 A7) (B0 B2 B4 B6 B1 B3 B5 B7)
unpacklo/hi 4: (A0 A2 A4 A6 B0 B2 B4 B6) (A1 A3 A5 A7 B1 B3 B5 B7)
The method suggested by EOF is correct but would require log2(8)=3
unpack
operations for each output:
input: (A0 A1 A2 A3 A4 A5 A6 A7) (B0 B1 B2 B3 B4 B5 B6 B7)
unpacklo/hi 1: (A0 B0 A1 B1 A2 B2 A3 B3) (A4 B4 A5 B5 A6 B6 A7 B7)
unpacklo/hi 1: (A0 A4 B0 B4 A1 A5 B1 B5) (A2 A6 B2 B6 A3 A7 B3 B7)
unpacklo/hi 1: (A0 A2 A4 A6 B0 B2 B4 B6) (A1 A3 A5 A7 B1 B3 B5 B7)
it should be possible to derive deswizzling code just by inverting the operations
Get used to being disappointed and frustrated by the non-orthogonality of Intel's vector shuffles. There is no direct inverse for punpck
. The SSE/AVX pack
instructions are for narrowing the element size. (So one packusdw
is the inverse of punpck[lh]wd
against zero, but not when used with two arbitrary vectors). Also, pack
instructions are only available for 32->16 (dword to word) and 16->8 (word to byte) element size. There is no packusqd
(64->32).
PACK instructions are only available with saturation, not truncation (until AVX512 vpmovqd
), so for this use-case we'd need to prepare 4 different input vectors for 2 PACK instructions. This turns out to be horrible, much worse than your 3-shuffle solution (see unzip32_pack()
in the Godbolt link below).
There is a 2-input shuffle that will do what you want for 32-bit elements, though: shufps
. The low 2 elements of the result can be any 2 elements of the first vector, and the high 2 element can be any elements of the second vector. The shuffle we want fits those constraints, so we can use it.
We can solve the whole problem in 2 instructions (plus a movdqa
for the non-AVX version, because shufps
destroys the left input register):
inputs: a=(A0 A1 A2 A3) a=(B0 B1 B2 B3)
_mm_shuffle_ps(a,b,_MM_SHUFFLE(2,0,2,0)); // (A0 A2 B0 B2)
_mm_shuffle_ps(a,b,_MM_SHUFFLE(3,1,3,1)); // (A1 A3 B1 B3)
_MM_SHUFFLE()
uses most-significant-element first notation, like all of Intel's documentation. Your notation is opposite.
The only intrinsic for shufps
uses __m128
/ __m256
vectors (float
not integer), so you have to cast to use it. _mm_castsi128_ps
is a reinterpret_cast: it compiles to zero instructions.
#include <immintrin.h>
static inline
__m128i unziplo(__m128i a, __m128i b) {
__m128 aps = _mm_castsi128_ps(a);
__m128 bps = _mm_castsi128_ps(b);
__m128 lo = _mm_shuffle_ps(aps, bps, _MM_SHUFFLE(2,0,2,0));
return _mm_castps_si128(lo);
}
static inline
__m128i unziphi(__m128i a, __m128i b) {
__m128 aps = _mm_castsi128_ps(a);
__m128 bps = _mm_castsi128_ps(b);
__m128 hi = _mm_shuffle_ps(aps, bps, _MM_SHUFFLE(3,1,3,1));
return _mm_castps_si128(hi);
}
gcc will inline these to a single instruction each. With the static inline
removed, we can see how they'd compile as non-inline functions. I put them on the Godbolt compiler explorer
unziplo(long long __vector(2), long long __vector(2)):
shufps xmm0, xmm1, 136
ret
unziphi(long long __vector(2), long long __vector(2)):
shufps xmm0, xmm1, 221
ret
Using FP shuffles on integer data is fine on recent Intel/AMD CPUs. There is no extra bypass-delay latency (See this answer which summarizes what Agner Fog's microarch guide says about it). It has extra latency on Intel Nehalem , but may still be the best choice there. FP loads/shuffles won't fault or corrupt integer bit-patterns that represent a NaN, only actual FP math instructions care about that.
Fun fact: on AMD Bulldozer-family CPUs (and Intel Core2), FP shuffles like shufps
still run in the ivec domain, so they actually have extra latency when used between FP instructions, but not between integer instructions!
Unlike ARM NEON / ARMv8 SIMD, x86 SSE doesn't have any 2-output-register instructions, and they're rare in x86. (They exist, e.g. mul r64
, but always decode to multiple uops on current CPUs).
It's always going to take at least 2 instructions to create 2 vectors of results. It would be ideal if they didn't both need to run on the shuffle port, since recent Intel CPUs have a shuffle throughput of only 1 per clock. Instruction-level parallelism doesn't help much when all your instructions are shuffles.
For throughput, 1 shuffle + 2 non-shuffles could be more efficient than 2 shuffles, and have the same latency. Or even 2 shuffles and 2 blends could be more efficient than 3 shuffles, depending on what the bottleneck is in the surrounding code. But I don't think we can replace 2x shufps
with that few instructions.
SHUFPS
:Your shuffle + unpacklo/hi is pretty good. It would be 4 shuffles total: 2 pshufd
to prepare the inputs, then 2 punpck
l/h. This is likely to be worse than any bypass latency, except on Nehalem in cases where latency matters but throughput doesn't.
Any other option would seem to require preparing 4 input vectors, for either a blend or packss
. See @Mysticial's answer to _mm_shuffle_ps() equivalent for integer vectors (__m128i)? for the blend option. For two outputs, that would take a total of 4 shuffles to make the inputs, and then 2x pblendw
(fast) or vpblendd
(even faster).
Using packsswd
or wb
for 16 or 8 bit elements would also work. It would take 2x pand
instructions to mask off the odd elements of a and b, and 2x psrld
to shift the odd elements down to the even positions. That sets you up for 2x packsswd
to create the two output vectors. 6 total instructions, plus many movdqa
because those all destroy their inputs (unlike pshufd
which is a copy+shuffle).
// don't use this, it's not optimal for any CPU
void unzip32_pack(__m128i &a, __m128i &b) {
__m128i a_even = _mm_and_si128(a, _mm_setr_epi32(-1, 0, -1, 0));
__m128i a_odd = _mm_srli_epi64(a, 32);
__m128i b_even = _mm_and_si128(b, _mm_setr_epi32(-1, 0, -1, 0));
__m128i b_odd = _mm_srli_epi64(b, 32);
__m128i lo = _mm_packs_epi16(a_even, b_even);
__m128i hi = _mm_packs_epi16(a_odd, b_odd);
a = lo;
b = hi;
}
Nehalem is the only CPU where it might be worth using something other than 2x shufps
, because of it's high (2c) bypass delay. It has 2 per clock shuffle throughput, and pshufd
is a copy+shuffle, so 2x pshufd
to prepare copies of a
and b
would only need one extra movdqa
after that to get the punpckldq
and punpckhdq
results into separate registers. (movdqa
isn't free; it has 1c latency and needs a vector execution port on Nehalem. It's only cheaper than a shuffle if you're bottlenecked on shuffle throughput, rather than overall front-end bandwidth (uop throughput) or something.)
I very much recommend just using 2x shufps
. It will be good on the average CPU, and not horrible anywhere.
AVX512 introduced a lane-crossing pack-with-truncation instruction that narrows a single vector (instead of being a 2-input shuffle). It's the inverse of pmovzx
, and can narrow 64b->8b or any other combination, instead of only by a factor of 2.
For this case, __m256i _mm512_cvtepi64_epi32 (__m512i a)
(vpmovqd
) will take the even 32-bit elements from a vector and pack them together. (i.e. the low halves of each 64-bit element). It's still not a good building block for an interleave, though, since you need something else to get the odd elements into place.
It also comes in signed/unsigned saturation versions. The instructions even have a memory-destination form that the intrinsics expose to let you do a masked-store.
But for this problem, as Mysticial points out, AVX512 provides 2-input lane-crossing shuffles which you can use like shufps
to solve the whole problem in just two shuffles: vpermi2d/vpermt2d
.