Understanding the practical application of Intel's _mm256_shuffle_epi8 definition

I have some questions concerning the rationale behind the definition of the _mm256_shuffle_epi8() function:

Why does it have to set the result vector to zero in all cases that the most significant bit of the offset b[i] is on ((b[i] & 0x80) != 0)? What is the practical application of having this? Wouldn't it have been easier to use if this test was omitted?
Otherwise, why is the definition non-symmetrical for the two parts of the vector? In fact, for 0<=i<16, a[b[i] & 0x0F] is returned, while for the remaining 16<=i<32 the index stored in b is incremented by 16: a[16+b[i] & 0x0F]. What is the purpose of doing this? Having a symmetrical definition would have been easier to understand, and would have made the code using the function easier to understand (basically, for the higher bits, b values are subtracted by 16, but this feels rather unnatural).

For a function that should use b to determine a's values position in r, it is rather convoluted and full of tests and assumptions that make it really hard to use it for simply computing r[i] = a[b[i]] for all values 0<=i<32. I am wondering if I am missing something about the practical usefulness of this definition in real-world scenarios.

Solution

Why does it have to set the result vector to zero

Makes the instruction more powerful. Instead of 16 possible output bytes input[ 0 .. 15 ] it can also produce zeros.

If that’s not what you want, note bitwise instructions are very fast; it’s easy to mask away these 4 higher bits in each byte with a bitwise instruction, either AND or ANDNOT.

What is the practical application of having this?

Here’s one example

why is the definition non-symmetrical for the two parts of the vector?

This way CPU designers can implement that instruction without 32-byte vectors in the hardware.

For example, Intel’s slow Adler Lake E cores don’t have 32-byte vectors. On these CPUs, _mm256_shuffle_epi8 instruction decodes into 2 micro-ops and takes 2 cycles of latency. The 16-bytes version _mm_shuffle_epi8 is decoded into 1 micro-op, and takes 1 cycle of latency. Obviously, that CPU splits the input 32 bytes vectors into two 16 bytes pieces, and sequentially applies the same algorithm to both pieces.

Very similar situation on the first-generation AMD Zen 1 processors, also two micro-ops for that instruction, only unlike the slow Intel cores AMD Zen 1 has more execution units to compute both pieces in parallel, and deliver the complete result on the next cycle. Still, also two micro-ops.