ssesimdneonavx

sse/avx equivalent for neon vuzp


Intel's vector extensions SSE, AVX, etc. provide two unpack operations for each element size, e.g. SSE intrinsics are _mm_unpacklo_* and _mm_unpackhi_*. For 4 elements in a vector, it does this:

inputs:      (A0 A1 A2 A3) (B0 B1 B2 B3)
unpacklo/hi: (A0 B0 A1 B1) (A2 B2 A3 B3)

The equivalent of unpack is vzip in ARM's NEON instruction set. However, the NEON instruction set also provides the operation vuzp which is the inverse of vzip. For 4 elements in a vector, it does this:

inputs: (A0 A1 A2 A3) (B0 B1 B2 B3)
vuzp:   (A0 A2 B0 B2) (A1 A3 B1 B3)

How can vuzp be implemented efficiently using SSE or AVX intrinsics? There doesn't seem to be an instruction for it. For 4 elements, I assume it can be done using a shuffle and a subsequent unpack moving 2 elements:

inputs:        (A0 A1 A2 A3) (B0 B1 B2 B3)
shuffle:       (A0 A2 A1 A3) (B0 B2 B1 B3)
unpacklo/hi 2: (A0 A2 B0 B2) (A1 A3 B1 B3)

Is there a more efficient solution using a single instruction? (Maybe for SSE first - I'm aware that for AVX we may have the additional problem that shuffle and unpack don't cross lanes.)

Knowing this may be useful for writing code for data swizzling and deswizzling (it should be possible to derive deswizzling code just by inverting the operations of swizzling code based on unpack operations).

Edit: Here is the 8-element version: This is the effect of NEON's vuzp:

input:         (A0 A1 A2 A3 A4 A5 A6 A7) (B0 B1 B2 B3 B4 B5 B6 B7)
vuzp:          (A0 A2 A4 A6 B0 B2 B4 B6) (A1 A3 A5 A7 B1 B3 B5 B7)

This is my version with one shuffle and one unpack for each output element (seems to generalize to larger element numbers):

input:         (A0 A1 A2 A3 A4 A5 A6 A7) (B0 B1 B2 B3 B4 B5 B6 B7)
shuffle:       (A0 A2 A4 A6 A1 A3 A5 A7) (B0 B2 B4 B6 B1 B3 B5 B7)
unpacklo/hi 4: (A0 A2 A4 A6 B0 B2 B4 B6) (A1 A3 A5 A7 B1 B3 B5 B7)

The method suggested by EOF is correct but would require log2(8)=3 unpack operations for each output:

input:         (A0 A1 A2 A3 A4 A5 A6 A7) (B0 B1 B2 B3 B4 B5 B6 B7)
unpacklo/hi 1: (A0 B0 A1 B1 A2 B2 A3 B3) (A4 B4 A5 B5 A6 B6 A7 B7)
unpacklo/hi 1: (A0 A4 B0 B4 A1 A5 B1 B5) (A2 A6 B2 B6 A3 A7 B3 B7)
unpacklo/hi 1: (A0 A2 A4 A6 B0 B2 B4 B6) (A1 A3 A5 A7 B1 B3 B5 B7)

Solution

  • it should be possible to derive deswizzling code just by inverting the operations

    Get used to being disappointed and frustrated by the non-orthogonality of Intel's vector shuffles. There is no direct inverse for punpck. The SSE/AVX pack instructions are for narrowing the element size. (So one packusdw is the inverse of punpck[lh]wd against zero, but not when used with two arbitrary vectors). Also, pack instructions are only available for 32->16 (dword to word) and 16->8 (word to byte) element size. There is no packusqd (64->32).

    PACK instructions are only available with saturation, not truncation (until AVX512 vpmovqd), so for this use-case we'd need to prepare 4 different input vectors for 2 PACK instructions. This turns out to be horrible, much worse than your 3-shuffle solution (see unzip32_pack() in the Godbolt link below).


    There is a 2-input shuffle that will do what you want for 32-bit elements, though: shufps. The low 2 elements of the result can be any 2 elements of the first vector, and the high 2 element can be any elements of the second vector. The shuffle we want fits those constraints, so we can use it.

    We can solve the whole problem in 2 instructions (plus a movdqa for the non-AVX version, because shufps destroys the left input register):

    inputs: a=(A0 A1 A2 A3) a=(B0 B1 B2 B3)
    _mm_shuffle_ps(a,b,_MM_SHUFFLE(2,0,2,0)); // (A0 A2 B0 B2)
    _mm_shuffle_ps(a,b,_MM_SHUFFLE(3,1,3,1)); // (A1 A3 B1 B3)
    

    _MM_SHUFFLE() uses most-significant-element first notation, like all of Intel's documentation. Your notation is opposite.

    The only intrinsic for shufps uses __m128 / __m256 vectors (float not integer), so you have to cast to use it. _mm_castsi128_ps is a reinterpret_cast: it compiles to zero instructions.

    #include <immintrin.h>
    static inline
    __m128i unziplo(__m128i a, __m128i b) {
        __m128 aps = _mm_castsi128_ps(a);
        __m128 bps = _mm_castsi128_ps(b);
        __m128 lo = _mm_shuffle_ps(aps, bps, _MM_SHUFFLE(2,0,2,0));
        return _mm_castps_si128(lo);
    }
    
    static inline    
    __m128i unziphi(__m128i a, __m128i b) {
        __m128 aps = _mm_castsi128_ps(a);
        __m128 bps = _mm_castsi128_ps(b);
        __m128 hi = _mm_shuffle_ps(aps, bps, _MM_SHUFFLE(3,1,3,1));
        return _mm_castps_si128(hi);
    }
    

    gcc will inline these to a single instruction each. With the static inline removed, we can see how they'd compile as non-inline functions. I put them on the Godbolt compiler explorer

    unziplo(long long __vector(2), long long __vector(2)):
        shufps  xmm0, xmm1, 136
        ret
    unziphi(long long __vector(2), long long __vector(2)):
        shufps  xmm0, xmm1, 221
        ret
    

    Using FP shuffles on integer data is fine on recent Intel/AMD CPUs. There is no extra bypass-delay latency (See this answer which summarizes what Agner Fog's microarch guide says about it). It has extra latency on Intel Nehalem , but may still be the best choice there. FP loads/shuffles won't fault or corrupt integer bit-patterns that represent a NaN, only actual FP math instructions care about that.

    Fun fact: on AMD Bulldozer-family CPUs (and Intel Core2), FP shuffles like shufps still run in the ivec domain, so they actually have extra latency when used between FP instructions, but not between integer instructions!


    Unlike ARM NEON / ARMv8 SIMD, x86 SSE doesn't have any 2-output-register instructions, and they're rare in x86. (They exist, e.g. mul r64, but always decode to multiple uops on current CPUs).

    It's always going to take at least 2 instructions to create 2 vectors of results. It would be ideal if they didn't both need to run on the shuffle port, since recent Intel CPUs have a shuffle throughput of only 1 per clock. Instruction-level parallelism doesn't help much when all your instructions are shuffles.

    For throughput, 1 shuffle + 2 non-shuffles could be more efficient than 2 shuffles, and have the same latency. Or even 2 shuffles and 2 blends could be more efficient than 3 shuffles, depending on what the bottleneck is in the surrounding code. But I don't think we can replace 2x shufps with that few instructions.


    Without SHUFPS:

    Your shuffle + unpacklo/hi is pretty good. It would be 4 shuffles total: 2 pshufd to prepare the inputs, then 2 punpckl/h. This is likely to be worse than any bypass latency, except on Nehalem in cases where latency matters but throughput doesn't.

    Any other option would seem to require preparing 4 input vectors, for either a blend or packss. See @Mysticial's answer to _mm_shuffle_ps() equivalent for integer vectors (__m128i)? for the blend option. For two outputs, that would take a total of 4 shuffles to make the inputs, and then 2x pblendw (fast) or vpblendd (even faster).

    Using packsswd or wb for 16 or 8 bit elements would also work. It would take 2x pand instructions to mask off the odd elements of a and b, and 2x psrld to shift the odd elements down to the even positions. That sets you up for 2x packsswd to create the two output vectors. 6 total instructions, plus many movdqa because those all destroy their inputs (unlike pshufd which is a copy+shuffle).

    // don't use this, it's not optimal for any CPU
    void unzip32_pack(__m128i &a, __m128i &b) {
        __m128i a_even = _mm_and_si128(a, _mm_setr_epi32(-1, 0, -1, 0));
        __m128i a_odd  = _mm_srli_epi64(a, 32);
        __m128i b_even = _mm_and_si128(b, _mm_setr_epi32(-1, 0, -1, 0));
        __m128i b_odd  = _mm_srli_epi64(b, 32);
        __m128i lo = _mm_packs_epi16(a_even, b_even);
        __m128i hi = _mm_packs_epi16(a_odd, b_odd);
        a = lo;
        b = hi;
    }
    

    Nehalem is the only CPU where it might be worth using something other than 2x shufps, because of it's high (2c) bypass delay. It has 2 per clock shuffle throughput, and pshufd is a copy+shuffle, so 2x pshufd to prepare copies of a and b would only need one extra movdqa after that to get the punpckldq and punpckhdq results into separate registers. (movdqa isn't free; it has 1c latency and needs a vector execution port on Nehalem. It's only cheaper than a shuffle if you're bottlenecked on shuffle throughput, rather than overall front-end bandwidth (uop throughput) or something.)

    I very much recommend just using 2x shufps. It will be good on the average CPU, and not horrible anywhere.


    AVX512

    AVX512 introduced a lane-crossing pack-with-truncation instruction that narrows a single vector (instead of being a 2-input shuffle). It's the inverse of pmovzx, and can narrow 64b->8b or any other combination, instead of only by a factor of 2.

    For this case, __m256i _mm512_cvtepi64_epi32 (__m512i a) (vpmovqd) will take the even 32-bit elements from a vector and pack them together. (i.e. the low halves of each 64-bit element). It's still not a good building block for an interleave, though, since you need something else to get the odd elements into place.

    It also comes in signed/unsigned saturation versions. The instructions even have a memory-destination form that the intrinsics expose to let you do a masked-store.

    But for this problem, as Mysticial points out, AVX512 provides 2-input lane-crossing shuffles which you can use like shufps to solve the whole problem in just two shuffles: vpermi2d/vpermt2d.