c++simdavxavx2sign-extension

Unpacking nibbles to bytes - Direct instructions/ Efficient Way to implement and keep sign


const __m128i mask = _mm_set1_epi8(0x0F);
const __m128i vec_unpack_one = _mm_and_si128(vec, mask);
const __m128i vec_unpack_two = _mm_and_si128(_mm_srli_epi16(vec, 4), mask);

I am here having a set of 32 nibbles stored in vec. I want to unpack it and store each nibble as byte which is what the 2nd and 3rd lines of the code snippet is trying to do. However I want to retain the sign of the nibble and sign extend it to a byte

For eg, one of the 8 bit element of vec - 01111011.

In vec_unpack_one, its unpacked currently as 00001011 whereas in vec_unpack_two its unpacked as 00000111. However I want the unpacked value in vec_unpacked_one as 11111011 as otherwise the values used in subsequent operation diverges from what was actually intended.

The current solution I was having in mind is to separate out the MSB of nibble post the bitwise and operations, and do some sort of masked or operations based on the bit. But are there ways to achieve this through a direct instruction or in more efficient ways. Suggestions are welcome. Thanks


Solution

  • Sign-extending a nibble could be done with pshufb (_mm_shuffle_epi8) as a lookup table. You do still need to mask away the high bit(s) since having the MSB set in the index byte zeros the corresponding output instead of indexing the other vector.

    So you'd still start with the same code you have (the standard way to separate nibbles) and do
    v0 = _mm_shuffle_epi8(sign_extend_lut, v0) and the same for v1.


    That's probably your best bet vs. a bithack like (x ^ m) - m (2 instructions per half using _mm_xor_si128 and _mm_sub_epi8) where m is 1U << 3 aka 8 (Sign extend a nine-bit number in C / https://graphics.stanford.edu/~seander/bithacks.html#FixedSignExtend) which also needs the upper bits zeroed ahead of time.

    Unless your surrounding code is very shuffle-heavy, especially if older Intel CPUs are important (Haswell to Skylake-family with only 1/clock shuffle throughput: https://uops.info). Then possibly consider the bithack.

    We can XOR the input with _mm_set1_epi8(0x88) before separating nibbles, so that makes this bithack version only one uop more expensive than the pshufb version instead of two, although then it needs two different vector constants. (Thanks @chtz).


    x86's narrowest SIMD shift is 16-bit, so arithmetic right shift isn't going to help, unfortunately.

    If you were later widening further to 16 or 32-bit, you could vpmovsxbw or vpmovsxbd and then arithmetic right shift. (_mm_cvtepi8_epi16 / _mm_srai_epi16 or their _mm256 equivalents) for the upper nibbles at least.