c++simdintrinsicssse2avx2

Emulating byte-shifts on 32 bytes with AVX (lane-crossing)


I am migrating vectorized code written using SSE2 intrinsics to AVX2 intrinsics.

Much to my disappointment, I discover that the byte-shift instructions _mm256_slli_si256 and _mm256_srli_si256 operate only on the two halves of the AVX registers separately and zeroes are introduced in between. (This is by contrast with _mm_slli_si128 and _mm_srli_si128 that handle whole SSE registers.)

Can you recommend me a short substitute ?


UPDATE:

_mm256_slli_si256 for shifts larger than 16 bytes is efficiently achieved with either of:

_mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 3, 0)), N)
// or
_mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 3, 0)), N)

(Or use vpermq, _mm256_permute4x64_epi64, which is better than vpermi128 on some CPUs, but worse on Zen 2 and 3.)

But the question remains for _mm256_srli_si256.


Solution

  • From different inputs, I gathered these solutions. The key to crossing the inter-lane barrier is the align instruction, _mm256_alignr_epi8.

    _mm256_slli_si256(A, N)

    0 < N < 16

    _mm256_alignr_epi8(A, _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), 16 - N)
    

    N = 16

    _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0))
    

    16 < N < 32

    _mm256_slli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(0, 0, 2, 0)), N - 16)
    

    _mm256_srli_si256(A, N)

    0 < N < 16

    _mm256_alignr_epi8(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1)), A, N)
    

    N = 16

    _mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1))
    

    16 < N < 32

    _mm256_srli_si256(_mm256_permute2x128_si256(A, A, _MM_SHUFFLE(2, 0, 0, 1)), N - 16)