c++simdavx2avx512

AVX2 repack an array of structs of 5 ints to structs of 7 ints, with the extra elements from other arrays? Shuffle/combine for 8 YMM registers?


After some processing I need to write my data and I wanted to optimize it with AVX2.
(An AVX-512 version is an optional extra; working fast with just AVX2 is the main goal.)

I have this destination format:

struct data
{
    uint32_t a, b, c, d, e;
    uint32_t f;
    uint32_t g;
};

In the source array, the first 5 elements are grouped, but the f and g elements are separate.

Overall, 7 avx2 (YMM registers / __m256i variables) hold 8 data structs, but I need to reshuffle/combine them to write to output.

The way I'm planning to load my data, the first 5 regs contain abcde packed data. (it was read from memory like abcdeabcdeabcde etc, abcde x8 times), then one avx2 reg contains fffffffff, and one avx2 reg contains gggggggg.

Registers basically contain this data, if I load 5 vectors from the abcde part and 1 vector each from the f and g arrays:

1st register: [a0,b0,c0,d0,e0, a1,b1,c1]

2nd register: [d1,e1, a2,b2,c2,d2,e2, a3]

3rd register: [b3,c3,d3,e3, a4,b4,c4,d4]

4th register: [e4, a5,b5,c5,d5,e5, a6,b6]

5th register: [c6,d6,e6, a7,b7,c7,d7,e7]

6th register: [f0,f1,f2,f3,f4,f5,f6,f7]

7th register: [g0,g1,g2,g3,g4,g5,g6,g7]

How can I reshuffle them to have abcdefg in each register? 1st register: [a0,b0,c0,d0,e0,f0,g0, a1], 2nd register: [b1,c1,d1,e1,f1,g1,a2,b2] and so on ...

Overall, the goal it to rearrange the elements to be able to output them ordered. The arrangement in vectors is actually an implementation detail; other ways to achieve the same result in memory are fine.


Actual code that I'm trying to speed up:

#pragma pack(push, 1)
struct Data
{
    uint32_t a,b,c,d,e;
    uint32_t f, g;
};
#pragma pack(pop)

static const int kItemsPerIteration = 1024;

void deinterleaveData(std::vector<Data>& dataOut, const std::vector<uint32_t>& dataIn)
{
    assert(dataIn.size() == kItemsPerIteration * sizeof(Data) + 1); // + 1 for delta
    dataOut.resize(kItemsPerIteration);
    uint32_t delta = dataIn[0];
    const uint32_t* a = dataIn.data() + 1; // abcde
    const uint32_t* f = &dataIn[0] + 1 + kItemsPerIteration * 5; // f
    const uint32_t* g = &dataIn[0] + 1 + kItemsPerIteration * 6; // g
    for (auto& x : dataOut)
    {
        x.a = delta + a[0];
        x.b = delta + a[1];
        x.c = delta + a[2];
        x.d = delta + a[3];
        x.e = delta + a[4];
        x.f = *f++;
        x.g = *g++;
        a += 5;
    }
}

https://godbolt.org/z/xr3aTeoo5


Solution

  • AVX2 - overlapping stores, some shuffle + blend is probably best

    Pavel's self-answer is the first step in the direction of the strategy I think is best with AVX2: vectorize the abcde add + copy, then store f and g

    We can refine further:

    #include <immintrin.h>
    #include ...
    
    void deinterleaveData_avx2_mixed_2store_and_blend(Data* dataOut, const uint32_t* dataIn)
    {
        uint32_t delta = dataIn[0];
        const uint32_t* a = dataIn + 1; // abcde
        const uint32_t* f = &dataIn[0] + 1 + kItemsPerIteration * 5; // f
        const uint32_t* g = &dataIn[0] + 1 + kItemsPerIteration * 6; // g
        const __m256i v_delta = _mm256_set1_epi32(delta);
        for (size_t i = 0; i < kItemsPerIteration; i += 8)
        {
            // load and vpunpckldq fg lanes
            // these loads are always misaligned unless we do the first 7 groups separately,
            // since 1 + 5*size is misaligned if size is a large power of 2.
            __m256i v_f = _mm256_loadu_si256((const __m256i*)f);
            __m256i v_g = _mm256_loadu_si256((const __m256i*)g);
            __m256i v_fg_lo = _mm256_unpacklo_epi32(v_f, v_g);
            __m256i v_fg_hi = _mm256_unpackhi_epi32(v_f, v_g);
    
          auto ABCDE_step = [=](int j, __m256i fg, const int fg_lane)
          {         // capture everything by value instead of by reference, so we can reassign without declaring temporaries, which apparently doesn't work inside switch cases.
            // fg_lane isn't constexpr enough for most _mm_extract_epi64 with compilers, only GCC with optimization enabled.
            // So we would need a CPP macro if we wanted to use extract, but it's slower and isn't available in 32-bit builds.
            auto& md = dataOut[i + j];
            __m256i abcde = _mm256_add_epi32(_mm256_loadu_si256((__m256i*)(a + j*5)), v_delta);
            switch(fg_lane){
                case 0:  // Store then overwrite FG with VMOVQ
                    _mm256_storeu_si256((__m256i*)&md, abcde);
                    _mm_storeu_si64(&md.f, _mm256_castsi256_si128(fg));
                    break;
    
                case 1:  // Store then overwrite FG with VMOVHPS (no ALU ports on Intel, unlike vpextrq mem, xmm, 1)
                    _mm256_storeu_si256((__m256i*)&md, abcde);
                    _mm_storeh_pi((__m64*)&md.f, _mm256_castps256_ps128(_mm256_castsi256_ps(fg)));
                    break;
    
                case 2: // Shuffle and blend, then one store. VPSHUFD runs on p15 on Intel since Ice Lake
                    fg = _mm256_shuffle_epi32(fg, _MM_SHUFFLE(3, 1,0, 2));   // [f3 f2g2 g3] in high lane, low lane is don't-care
                    abcde = _mm256_blend_epi32(abcde, fg, 0b0110'0000);  // VPBLENDD is cheap, any vector ALU port
                    _mm256_storeu_si256((__m256i*)&md, abcde);   // [abcd | efg x]
                    break;
    
                case 3: // Shuffle and blend, then one store
                    fg = _mm256_shuffle_epi32(fg, _MM_SHUFFLE(1, 3,2, 0));   // [f2 f3g3 g2]
                    abcde = _mm256_blend_epi32(abcde, fg, 0b0110'0000);
                    _mm256_storeu_si256((__m256i*)&md, abcde);
                    break;
    
            }
          };
    
            ABCDE_step(0, v_fg_lo, 0);
            ABCDE_step(1, v_fg_lo, 1);
            ABCDE_step(2, v_fg_hi, 0);
            ABCDE_step(3, v_fg_hi, 1);
            ABCDE_step(4, v_fg_lo, 2);
            ABCDE_step(5, v_fg_lo, 3);
            ABCDE_step(6, v_fg_hi, 2);
            ABCDE_step(7, v_fg_hi, 3);
    
            f += 8;
            g += 8;
            a += 5*8;
        }
    }
    
    
    
    /*
    design notes:
    load and vpunpckldq / hdq to make
    [fg0 fg1 | fg4 fg5]  
    [fg2 fg3 | fg6 fg7]
    (or maybe vshufps?  If only we could do something different in the low half vs. high half.)
    
    then load ABCDE data (with memory-source VPADDD)
     [abcd0 | e0 abc1 ] load
    =[abcd0 | e0 xxx]  store unmodified
               = fg0 movq store
    
     [abcd1 | e1 abc2 ] load
    =[abcd1 | e xxx ] store unmodified, overlapping 1 element
                = fg1 movhps store
    
    =[abcd2 | e2 xxx ] load+store unmodified
               = fg2 movq store from fg_high
    
    =[abcd3 | e3 xxx ] load+store unmodified
               = fg3 movhps store from fg_high
    
    Then shuffle high half of fg_low/high to set up for blends:
    
    [xxxx  | f5 fg4 g5 ] vpshufd (fg_low)
    =[abcd4 | e4 fg4 x ] load + VPBLENDD
    
     [xxxx  | f5 fg4 g5 ] vpshufd (fg_low)
    =[abcd5 | e5 fg5 x ] load + VPBLENDD
    
      # or maybe [abcd5 | e5 a6 fg5 ] load + VPBLENDD with original unpcklo fg
      # then another shuffle like vpermilps?  That's unfortunately port-5 only
    
    Then repeat with fg_high for fg6 and fg7
    */
    

    Godbolt with GCC/Clang and MSVC, some 32-bit. Thanks to @Pavel for the code I started with to add this functionality.

    If using std::vector, you'll want to change the lambda to capture at least it by reference, perhaps everything by reference. The lambda modifies fg and is expecting that change to be discarded, but that will still happen because __m256i fg is an explicit arg, not a capture.
    Other possible style tweaks include just giving the lambda a pointer or reference to the element it should store to, instead of int j and using the captured output vector. See discussion in comments under Pavel's answer; he also has a version using a #define instead.


    Footnote 1: 2/clock store throughput or not

    Wikichip claims Zen 3 can do 2/clock stores if they're not 256-bit, but that appears to be wrong.

    https://uops.info/ and https://agner.org/optimize/ both found that Zen 3 and Zen 4 only do 1 store per clock from vector registers, even for movq [mem], xmm. It's only scalar integer stores like mov [mem], reg or imm that have 2/clock throughput, not narrow vector stores.

    Zen 3 and later do have multiple execution units that vector stores can run on, so maybe it's something about commit to L1d cache that's the bottleneck, rather than writing them to the store buffer? uops.info's test was to have every store to the same address, which you'd expect hardware to handle well.

    So it might be best to do all 8 groups with shuffle + blend on Zen-family, and Intel before Ice Lake. vpermd with a vector constant is the obvious choice for getting elements from the bottom half into position for a blend (replacing the _mm256_shuffle_epi32), but it's 2 uops on Zen 2 and 3.
    Still 1/clock throughput so it's not a disaster, but on those CPUs 2x vpshufd xmm (on fd_lo and fd_hi) + 2x vinserti128 (to duplicate low half to high) + 2x vpshufd ymm (to make the other arrangement in the high lane) would be only 6 total shuffle uops instead of 8.
    (Using XMM shuffles and vinserti128 when possible is better on Zen 1, which only has 128-bit shuffle execution units. If tuning the whole algorithm for Zen 1 you might do something different, but I'm looking at Zen 1 optimizations that don't make it any worse for Zen 2 or 3. Intel Gracemont E-cores in Alder Lake and later also only have 128-bit vector execution units so would similarly benefit; vpermd and vpshufd are both 2 uops on Gracemont.)

    Or we could do one vpermd for each of fg_lo and fg_hi, creating [f1 f0 g0 g1] in the top half. Then [x f1 g1 x] is just one in-lane vpshufd away. That minimizes lane-crossing shuffles (just 2), and minimize total shuffle uops (just 1 per blend, plus the 2 to combine fg data initially) across Intel and Zen. (Other than Zen 1 and Gracemont, and it's not bad there.)

    The initial combining of f and g data could maybe be improved if we're never going to use those 64-bit-aligned fg pairs directly, only as setup for blends. Like perhaps ffff gggg with vinserti128, which doesn't need port 5.

    We could load the abcde data so there's a garbage element at the start instead of end, meaning we want an fg at the very top of the vector, so the fg is naturally aligned within a 64-bit vector element. (And so 2 of the fg pairs are already in the right place after unpack, saving those two vpshufd shuffles).
    To do overlapping stores with that we'd have to work backwards, from the ends of the input and output arrays. Which is ok, especially if data is hot in L1d cache. (vmaskmovps is also possible for looping forwards and ignoring first-element garbage, but it's slower on Intel, and a lot slower for stores on AMD, surprisingly even on Zen 4 and later which has good AVX-512 masked-store throughput.)