c++csimdintrinsicsavx2

Understanding the practical application of Intel's _mm256_shuffle_epi8 definition


I have some questions concerning the rationale behind the definition of the _mm256_shuffle_epi8() function:

For a function that should use b to determine a's values position in r, it is rather convoluted and full of tests and assumptions that make it really hard to use it for simply computing r[i] = a[b[i]] for all values 0<=i<32. I am wondering if I am missing something about the practical usefulness of this definition in real-world scenarios.


Solution

  • Why does it have to set the result vector to zero

    Makes the instruction more powerful. Instead of 16 possible output bytes input[ 0 .. 15 ] it can also produce zeros.

    If that’s not what you want, note bitwise instructions are very fast; it’s easy to mask away these 4 higher bits in each byte with a bitwise instruction, either AND or ANDNOT.

    What is the practical application of having this?

    Here’s one example

    why is the definition non-symmetrical for the two parts of the vector?

    This way CPU designers can implement that instruction without 32-byte vectors in the hardware.

    For example, Intel’s slow Adler Lake E cores don’t have 32-byte vectors. On these CPUs, _mm256_shuffle_epi8 instruction decodes into 2 micro-ops and takes 2 cycles of latency. The 16-bytes version _mm_shuffle_epi8 is decoded into 1 micro-op, and takes 1 cycle of latency. Obviously, that CPU splits the input 32 bytes vectors into two 16 bytes pieces, and sequentially applies the same algorithm to both pieces.

    Very similar situation on the first-generation AMD Zen 1 processors, also two micro-ops for that instruction, only unlike the slow Intel cores AMD Zen 1 has more execution units to compute both pieces in parallel, and deliver the complete result on the next cycle. Still, also two micro-ops.