AVX2 repack an array of structs of 5 ints to structs of 7 ints, with the extra elements from other arrays? Shuffle/combine for 8 YMM registers?

After some processing I need to write my data and I wanted to optimize it with AVX2.
(An AVX-512 version is an optional extra; working fast with just AVX2 is the main goal.)

I have this destination format:

struct data
{
    uint32_t a, b, c, d, e;
    uint32_t f;
    uint32_t g;
};

In the source array, the first 5 elements are grouped, but the f and g elements are separate.

Overall, 7 avx2 (YMM registers / __m256i variables) hold 8 data structs, but I need to reshuffle/combine them to write to output.

The way I'm planning to load my data, the first 5 regs contain abcde packed data. (it was read from memory like abcdeabcdeabcde etc, abcde x8 times), then one avx2 reg contains fffffffff, and one avx2 reg contains gggggggg.

Registers basically contain this data, if I load 5 vectors from the abcde part and 1 vector each from the f and g arrays:

1st register: [a0,b0,c0,d0,e0, a1,b1,c1]

2nd register: [d1,e1, a2,b2,c2,d2,e2, a3]

3rd register: [b3,c3,d3,e3, a4,b4,c4,d4]

4th register: [e4, a5,b5,c5,d5,e5, a6,b6]

5th register: [c6,d6,e6, a7,b7,c7,d7,e7]

6th register: [f0,f1,f2,f3,f4,f5,f6,f7]

7th register: [g0,g1,g2,g3,g4,g5,g6,g7]

How can I reshuffle them to have abcdefg in each register? 1st register: [a0,b0,c0,d0,e0,f0,g0, a1], 2nd register: [b1,c1,d1,e1,f1,g1,a2,b2] and so on ...

Overall, the goal it to rearrange the elements to be able to output them ordered. The arrangement in vectors is actually an implementation detail; other ways to achieve the same result in memory are fine.

Actual code that I'm trying to speed up:

#pragma pack(push, 1)
struct Data
{
    uint32_t a,b,c,d,e;
    uint32_t f, g;
};
#pragma pack(pop)

static const int kItemsPerIteration = 1024;

void deinterleaveData(std::vector<Data>& dataOut, const std::vector<uint32_t>& dataIn)
{
    assert(dataIn.size() == kItemsPerIteration * sizeof(Data) + 1); // + 1 for delta
    dataOut.resize(kItemsPerIteration);
    uint32_t delta = dataIn[0];
    const uint32_t* a = dataIn.data() + 1; // abcde
    const uint32_t* f = &dataIn[0] + 1 + kItemsPerIteration * 5; // f
    const uint32_t* g = &dataIn[0] + 1 + kItemsPerIteration * 6; // g
    for (auto& x : dataOut)
    {
        x.a = delta + a[0];
        x.b = delta + a[1];
        x.c = delta + a[2];
        x.d = delta + a[3];
        x.e = delta + a[4];
        x.f = *f++;
        x.g = *g++;
        a += 5;
    }
}

https://godbolt.org/z/xr3aTeoo5

Solution

AVX2 - overlapping stores, some shuffle + blend is probably best

Pavel's self-answer is the first step in the direction of the strategy I think is best with AVX2: vectorize the abcde add + copy, then store f and g

We can refine further:

Prepare a vector of fg pairs (with vpunpckldq/hdq), so we can do 64-bit stores of those elements.
Use vmovq and vmovhps, not with vpextrq [mem], xmm, 1. vpextrq takes an ALU shuffle uop on both Intel and AMD CPUs. vmovhps mem, xmm is a pure store on Intel (e.g. ports 4/9 and 7/8 on Ice Lake family, no uops for ports 1 or 5 where the ALU shuffle execution units are located. It looks like Zen 4 does use ALU uops. https://uops.info/.) And vmovhps has no immediate operand so it can micro-fuse into a single uop for the front-end and ROB (ReOrder Buffer). Also, vpextrq doesn't exist in 32-bit mode, which the OP apparently cares about.

So that's 2 stores per abcdefg group (instead of 3), and many fewer loads, at a cost of a few SIMD shuffles.
The high 128-bit lane of a YMM vector is less convenient for storing 64-bit chunks to memory. Probably only vmaskmovps with the right address offset and a mask vector could get the 2 elements we want into the right place, but it's horribly slow on AMD. vpextrq r/m, ymm, imm doesn't exist, so you have to vextracti128 to get the high half into an XMM. (From there you can vmovq and vmovhps the low and high fg pair without further shuffle uops, though, so it's not a total disaster if you do it right.)

So for the last four fg pairs, we can instead shuffle them into place for vpblendd to set up for a single 256-bit store with all 7 elements being useful.

This means fewer total stores (which is probably a bottleneck on CPUs with 1/clock stores, like Intel before Ice Lake), but not too many extra instructions since we're not shuffling the low elements up to the top. Which we could do with vpermd; perhaps a better strategy on Skylake and earlier, and especially on Zen uarches that only do 1/clock vector stores¹ but have a wide pipeline with lots of vector ALU throughput.

#include <immintrin.h>
#include ...

void deinterleaveData_avx2_mixed_2store_and_blend(Data* dataOut, const uint32_t* dataIn)
{
    uint32_t delta = dataIn[0];
    const uint32_t* a = dataIn + 1; // abcde
    const uint32_t* f = &dataIn[0] + 1 + kItemsPerIteration * 5; // f
    const uint32_t* g = &dataIn[0] + 1 + kItemsPerIteration * 6; // g
    const __m256i v_delta = _mm256_set1_epi32(delta);
    for (size_t i = 0; i < kItemsPerIteration; i += 8)
    {
        // load and vpunpckldq fg lanes
        // these loads are always misaligned unless we do the first 7 groups separately,
        // since 1 + 5*size is misaligned if size is a large power of 2.
        __m256i v_f = _mm256_loadu_si256((const __m256i*)f);
        __m256i v_g = _mm256_loadu_si256((const __m256i*)g);
        __m256i v_fg_lo = _mm256_unpacklo_epi32(v_f, v_g);
        __m256i v_fg_hi = _mm256_unpackhi_epi32(v_f, v_g);

      auto ABCDE_step = [=](int j, __m256i fg, const int fg_lane)
      {         // capture everything by value instead of by reference, so we can reassign without declaring temporaries, which apparently doesn't work inside switch cases.
        // fg_lane isn't constexpr enough for most _mm_extract_epi64 with compilers, only GCC with optimization enabled.
        // So we would need a CPP macro if we wanted to use extract, but it's slower and isn't available in 32-bit builds.
        auto& md = dataOut[i + j];
        __m256i abcde = _mm256_add_epi32(_mm256_loadu_si256((__m256i*)(a + j*5)), v_delta);
        switch(fg_lane){
            case 0:  // Store then overwrite FG with VMOVQ
                _mm256_storeu_si256((__m256i*)&md, abcde);
                _mm_storeu_si64(&md.f, _mm256_castsi256_si128(fg));
                break;

            case 1:  // Store then overwrite FG with VMOVHPS (no ALU ports on Intel, unlike vpextrq mem, xmm, 1)
                _mm256_storeu_si256((__m256i*)&md, abcde);
                _mm_storeh_pi((__m64*)&md.f, _mm256_castps256_ps128(_mm256_castsi256_ps(fg)));
                break;

            case 2: // Shuffle and blend, then one store. VPSHUFD runs on p15 on Intel since Ice Lake
                fg = _mm256_shuffle_epi32(fg, _MM_SHUFFLE(3, 1,0, 2));   // [f3 f2g2 g3] in high lane, low lane is don't-care
                abcde = _mm256_blend_epi32(abcde, fg, 0b0110'0000);  // VPBLENDD is cheap, any vector ALU port
                _mm256_storeu_si256((__m256i*)&md, abcde);   // [abcd | efg x]
                break;

            case 3: // Shuffle and blend, then one store
                fg = _mm256_shuffle_epi32(fg, _MM_SHUFFLE(1, 3,2, 0));   // [f2 f3g3 g2]
                abcde = _mm256_blend_epi32(abcde, fg, 0b0110'0000);
                _mm256_storeu_si256((__m256i*)&md, abcde);
                break;

        }
      };

        ABCDE_step(0, v_fg_lo, 0);
        ABCDE_step(1, v_fg_lo, 1);
        ABCDE_step(2, v_fg_hi, 0);
        ABCDE_step(3, v_fg_hi, 1);
        ABCDE_step(4, v_fg_lo, 2);
        ABCDE_step(5, v_fg_lo, 3);
        ABCDE_step(6, v_fg_hi, 2);
        ABCDE_step(7, v_fg_hi, 3);

        f += 8;
        g += 8;
        a += 5*8;
    }
}



/*
design notes:
load and vpunpckldq / hdq to make
[fg0 fg1 | fg4 fg5]  
[fg2 fg3 | fg6 fg7]
(or maybe vshufps?  If only we could do something different in the low half vs. high half.)

then load ABCDE data (with memory-source VPADDD)
 [abcd0 | e0 abc1 ] load
=[abcd0 | e0 xxx]  store unmodified
           = fg0 movq store

 [abcd1 | e1 abc2 ] load
=[abcd1 | e xxx ] store unmodified, overlapping 1 element
            = fg1 movhps store

=[abcd2 | e2 xxx ] load+store unmodified
           = fg2 movq store from fg_high

=[abcd3 | e3 xxx ] load+store unmodified
           = fg3 movhps store from fg_high

Then shuffle high half of fg_low/high to set up for blends:

[xxxx  | f5 fg4 g5 ] vpshufd (fg_low)
=[abcd4 | e4 fg4 x ] load + VPBLENDD

 [xxxx  | f5 fg4 g5 ] vpshufd (fg_low)
=[abcd5 | e5 fg5 x ] load + VPBLENDD

  # or maybe [abcd5 | e5 a6 fg5 ] load + VPBLENDD with original unpcklo fg
  # then another shuffle like vpermilps?  That's unfortunately port-5 only

Then repeat with fg_high for fg6 and fg7
*/

Godbolt with GCC/Clang and MSVC, some 32-bit. Thanks to @Pavel for the code I started with to add this functionality.

If using std::vector, you'll want to change the lambda to capture at least it by reference, perhaps everything by reference. The lambda modifies fg and is expecting that change to be discarded, but that will still happen because __m256i fg is an explicit arg, not a capture.
Other possible style tweaks include just giving the lambda a pointer or reference to the element it should store to, instead of int j and using the captured output vector. See discussion in comments under Pavel's answer; he also has a version using a #define instead.

Note that the last vector will write 1 element past the end of the last group. If your real size is often 1024, it might be best to special-case the end instead of writing to a padding element that's in a separate 4K page. Outside the loop, ideally we still have the fg_high from the last iteration. vextracti128 of the top fg pair, vmovq or vpaddd xmm, xmm, [mem] load of the de elements (plus trailing garbage is fine), vpblendd to make defg. Store that and an overlapping abcd (prepared with another vpaddd x, x, [mem]) so we end up doing a 7-element store with 2x _mm_storeu_si128.
We can handle the last 2 groups of each 8 differently, if we want. The top fg pair in each of fg_low and fg_high can be blended into a load+add result without overwriting any abcde data we want to keep. Then we just need a 1-input shuffle to reorder the elements. Like vpermilps with a control vector that keeps the low 128 bits in the same order? In the upper 128 bits, we can put the next group's a after this fg, setting up for a store where we use all 8 elements. But that's rather inconvenient, making the next iteration start on a c if we started on a. 2 and 5 are relatively prime to each other, so the starting-element wouldn't repeat back to a for 5 iterations of different shuffle patterns.

But I think we can use this to solve the tail-overrun problem. If the second-last vector (j=6) has 8 useful elements, we can store it last, overwriting the last element of vec #5, and the first element of vec #7.
Vec #7 can be loaded offset by -1 so it's [x abc | de fg ], ending at the end of the set of 8 groups. (And not needing a shuffle, just a blend with fg_high.)

It's hopefully equal performance (or even a win) to do this inside the main loop, so it doesn't overwrite the end of the output for sizes that are a multiple of 8 groups. Unfortunately vpermilps can only run on port 5 on Intel, not on the second shuffle unit on port 1 in Ice Lake and later, but vpunpckl/hdq and vpshufd can, and of course vpblendd and vpaddd, so it should be fine. It does need a control vector constant to be loaded (outside the loop).
vpshufb would work, and does run on port 1. Shuffling every byte separately means we can't compress the shuffle mask with vpmovzxbd, but it's an integer shuffle in case that matters for bypass latency (it doesn't on Intel last I checked).

FIXME: it's actually the j=5 vector that uses the top pair from fg_lo. fg6 and fg7 are both in fg_hi, where they can't both be the top element. This might defeat this idea, unless maybe we can do the initial fg unpack differently, e.g. with vshufps, to put the fg6 at the top of one vector and fg7 at the top of the other. But vshufps can't make adjacent pairs, the low 2 elements of each 128-bit lane come have to come from the first source, high 2 from the second source. So 2x vshufps + 2x vpshufd? Or we could make [f7 f5g5 g7] directly with vshufps, avoiding a vpshufd we'd otherwise do later as part of the j=4-7 iterations. vpshufd-immediate on that can make a vector with a useful low half for movq and movhps (I think), with [f5g5 f7g7] in the top.
Clang pessimizes _mm_storeh_pi((__m64*)&md.f, _mm256_castps256_ps128(_mm256_castsi256_ps(fg))); to vpshufd + vmovq instead of vmovhps. As usual, its shuffle optimizer defeats attempts at cleverness, improving bad code but sometimes worsening good code. GCC and MSVC compile as written.
Skylake / Coffee Lake probably still bottlenecks on store throughput, but not as badly as if we'd used 2 stores for every output group.

Per loop iteration (8 output groups), this is
- 6 shuffles (unpack lo/hi and 4x vpshufd)
- 12 stores (8 of them 256-bit, 4 of them movq/movhps 64-bit)
- 4x vpblendd + 8x vpaddd which can take any port.
Compilers are avoiding index addressing modes (at least here, IDK about in your std::vector version...) so port 7 can be used for store-address uops, allowing up to 2 loads + 1 store per clock, but we need less than 1 load per store with this strategy.

Footnote 1: 2/clock store throughput or not

Wikichip claims Zen 3 can do 2/clock stores if they're not 256-bit, but that appears to be wrong.

https://uops.info/ and https://agner.org/optimize/ both found that Zen 3 and Zen 4 only do 1 store per clock from vector registers, even for movq [mem], xmm. It's only scalar integer stores like mov [mem], reg or imm that have 2/clock throughput, not narrow vector stores.

Zen 3 and later do have multiple execution units that vector stores can run on, so maybe it's something about commit to L1d cache that's the bottleneck, rather than writing them to the store buffer? uops.info's test was to have every store to the same address, which you'd expect hardware to handle well.

So it might be best to do all 8 groups with shuffle + blend on Zen-family, and Intel before Ice Lake. vpermd with a vector constant is the obvious choice for getting elements from the bottom half into position for a blend (replacing the _mm256_shuffle_epi32), but it's 2 uops on Zen 2 and 3.
Still 1/clock throughput so it's not a disaster, but on those CPUs 2x vpshufd xmm (on fd_lo and fd_hi) + 2x vinserti128 (to duplicate low half to high) + 2x vpshufd ymm (to make the other arrangement in the high lane) would be only 6 total shuffle uops instead of 8.
(Using XMM shuffles and vinserti128 when possible is better on Zen 1, which only has 128-bit shuffle execution units. If tuning the whole algorithm for Zen 1 you might do something different, but I'm looking at Zen 1 optimizations that don't make it any worse for Zen 2 or 3. Intel Gracemont E-cores in Alder Lake and later also only have 128-bit vector execution units so would similarly benefit; vpermd and vpshufd are both 2 uops on Gracemont.)

Or we could do one vpermd for each of fg_lo and fg_hi, creating [f1 f0 g0 g1] in the top half. Then [x f1 g1 x] is just one in-lane vpshufd away. That minimizes lane-crossing shuffles (just 2), and minimize total shuffle uops (just 1 per blend, plus the 2 to combine fg data initially) across Intel and Zen. (Other than Zen 1 and Gracemont, and it's not bad there.)

The initial combining of f and g data could maybe be improved if we're never going to use those 64-bit-aligned fg pairs directly, only as setup for blends. Like perhaps ffff gggg with vinserti128, which doesn't need port 5.

We could load the abcde data so there's a garbage element at the start instead of end, meaning we want an fg at the very top of the vector, so the fg is naturally aligned within a 64-bit vector element. (And so 2 of the fg pairs are already in the right place after unpack, saving those two vpshufd shuffles).
To do overlapping stores with that we'd have to work backwards, from the ends of the input and output arrays. Which is ok, especially if data is hot in L1d cache. (vmaskmovps is also possible for looping forwards and ignoring first-element garbage, but it's slower on Intel, and a lot slower for stores on AMD, surprisingly even on Zen 4 and later which has good AVX-512 masked-store throughput.)