Why does _mm256_unpacklo "jump" a double-word and where does it says so in the documentation?

I find the _mm256_unpacklo_epi32 instruction a bit funny and cannot really correlate it with the documentation.

The instruction does the following:

#include <immintrin.h>
#include <iostream>

int main() {
    __m256i a = _mm256_set_epi32(8, 7, 6, 5, 4, 3, 2, 1);
    __m256i b = _mm256_set_epi32(16, 15, 14, 13, 12, 11, 10, 9);

    __m256i c = _mm256_unpacklo_epi32(a, b);

    int* values = (int*)&c;
    for (size_t i = 0; i < 8 - 1; i++) {
        std::cout << values[i] << ", ";
    }
    std::cout << values[7] << std::endl;
}

The output is:

1, 9, 2, 10, 5, 13, 6, 14

To me, it seems to "jump" the second lowest double-word in the two source values.

What's the reason for this? Is there an instruction that interweaves the lower 128bit vectors from the source vectors? The behavior doens't seem that useful to me.
I also cannot reconcile this behaviour with the docs: https://www.intel.com/content/www/us/en/docs/cpp-compiler/developer-guide-reference/2021-10/mm256-unpacklo-epi8-16-32-64.html . Where does it say that the second lowest double-word is ignored?

Solution

The doc you linked is so short it's useless if you don't already know what "The high-order data elements are ignored." means: high half of each 128-bit lane. Not the top 128 bits of the inputs.

AVX2 versions of SSE shuffles are two 128-bit shuffles in the two lanes of the __m256i; no data moves across the 128-bit boundary. i.e. they're "in-lane" shuffles, like AVX1 vpermilps.

vpunpackldq reads the low 64 bits of each 128-bit input. https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#techs=MMX,SSE_ALL,AVX_ALL,Other&ig_expand=7015&text=unpacklo_epi32 shows pseudocode that only reads the low 64 bits of each 128 bits of the sources:

DEFINE INTERLEAVE_DWORDS(src1[127:0], src2[127:0]) {
    dst[31:0] := src1[31:0] 
    dst[63:32] := src2[31:0] 
    dst[95:64] := src1[63:32] 
    dst[127:96] := src2[63:32] 
    RETURN dst[127:0]   
}
dst[127:0] := INTERLEAVE_DWORDS(a[127:0], b[127:0])
dst[255:128] := INTERLEAVE_DWORDS(a[255:128], b[255:128])
dst[MAX:256] := 0

Even the intrinsics guide isn't as good as the asm manual, which has a detailed Description section which explains in English (and sometimes with a diagram) what the instruction does, and sometimes hints at use-cases. For some instructions that's very helpful in understanding the pseudocode, for example shufps or pmaddubsw.

And they often have different pseudocode, so if one seems opaque, try the other one. The asm manual also lists available intrinsics, but sometimes is out-of-date or wrong about intrinsic names.

AVX2 has a few lane-crossing shuffles, like vpermd (dwords within one vector) and vperm2i128 (2-input 128-bit granularity, immediate control), but none that are 2-input with granularity smaller than 128-bit. For that you want AVX-512 vpermt2d (with a shuffle-control vector).

Sometimes you don't actually need your data in a particular order within vectors, e.g. if you're going to shuffle it back later. Then widening by zeroing odd/even, or maybe for 2 vectors blending odd/even elements could mix data in a usable way. Or just unpack lo / hi - that does still get all the elements into vectors, just in a different sequence.

What's the reason for this? Is there an instruction that interweaves the lower 128bit vectors from the source vectors? The behavior doesn't seem that useful to me.

Yeah, many of the AVX2 shuffles are basically useless, especially on their own. vpalignb is particularly bad. Not a great compromise between implementation cost vs. usefulness. AVX-512's valignd/q is lane-crossing and thus fit for the original purpose, as long as shifting by a multiple of 4 bytes works.

The packs / unpacks at least can be combined with lane-crossing shuffles, or if you only unpacked within 128-bit lanes then the saturating pack instructions do reverse it. (If I recall correctly. But often it's best to widen by zeroing out the odd or even elements, then after narrowing, merge with shift / OR or blend.)

If you do need a lane-crossing unpack, probably do 128-bit unpack lo/hi and vinserti128:

    __m256i a, b;

    __m128i lo = _mm_unpacklo_epi32(_mm256_castsi256_si128(a), _mm256_castsi256_si128(b));
    __m128i hi = _mm_unpackhi_epi32(_mm256_castsi256_si128(a), _mm256_castsi256_si128(b));

    __m256i c = _mm256_set_m128i(hi, lo);
   // or _mm256_inserti128_si256(_mm256_castsi128_si256(lo), hi, 1);

This is 3 shuffles for one result, so that sucks.

Of course if want 2 vectors of results that includes all your input, you'd use 256-bit unpacks and _mm256_permute2x128_si256(hi, lo, constant) to combine the high 2 lanes of the unpack outputs, so 4 total shuffles (unpackhi/lo, vinserti128, vperm2i128) to get c and d. So then it's "only" twice as bad as with AVX-512 vpermt2d instead of 3x.

int* values = (int*)&c; is strict-aliasing UB and can mis-compile in practice with GCC and/or Clang. Don't do that. print a __m128i variable