I want to use intrinsic or assembly instructions to shift 64 8-bit elements by 1 position. For example, this (1, 2, 3, ...., 63, 64) In this (0, 1, 2, 3,...,62, 63) without loss of elements.
for 32 8-bit elements and AVX, AVX2, you can do this: src = (1,2,3,...,63, 64), mask = (0, 0, ...,255,...,0,0) ( it is necessary as we lost 1 element),
a = _mm256_sli_si256(src, 1);
src = _mm256_and_si256(src, mask);
src = _mm256_permute2f128_si256(src, src, 1);
src = _mm256_srli_si256(src, 15);
src = _mm256_add_epi16(a, src);
as a result, src = (0, 1, 2, 3,...,62,63)
You can use _mm512_maskz_permutexvar_epi8
:
__mmask64 k = 0xfffffffffffffffe;// zero lower 8-bit value
auto idx = _mm512_set_epi8(62, 61, ..., 0, 0); // shift all elements by one
return _mm512_maskz_permutexvar_epi8(k, id, src);