c++x86intrinsicsavx2avx512

Fallback implementation for conflict detection in AVX2


AVX512CD contains the intrinsic _mm512_conflict_epi32(__m512i a) it returns a vector where for every element in a a bit is set if it has the same value. Is there a way to do something similar in AVX2?

I'm not interested in the extact bits I just need to know which elements are duplicates of the elements to their left (or right). I simply need to know if a scatter would conflict.

Basically I need an AVX2 equivalent for

__mm256i detect_conflict(__mm256i a) {
  __mm256i cd = _mm256_conflict_epi32(a);
  return _mm256_cmpgt_epi32(cd, _mm256_set1_epi32(0));
}

The only way I could think of is to use _mm256_permutevar8x32_epi32() shift each value right by 1 (across the lanes) and than do seven compares, mask out the unsed bits and than _mm256_or_si256() them together which is horribly slow.


Solution

  • TL:DR: Since full detection of which elements conflict is expensive, it's probably worth doing more fall-back work in exchange for cheaper detection. This depends on your conflict-handling options / strategies.

    I came up with a fairly efficient way check for presence/absence of conflicts without finding their locations, like this answer for 64-bit integer elements. It's actually faster than Skylake-AVX512's micro-coded vpconflictd ymm, but of course it gives you much less information. (KNL has fast vpconflictd).
    Zen 4 and 5 have fast 2-uop vpconflictd, and Zen 5 revived vp2intersectd from Tiger Lake (gone in Sapphire Rapids), so if you have AVX-512, those are probably useful for AMD uarches.

    You could use a fully-scalar fallback for all the elements if there are any conflicts. This would work well if conflicts are rare enough that branch-mispredicts don't kill performance. (AVX2 doesn't have scatter instructions in the first place, though, so I'm not sure exactly what you need this for.)

    The only-left or only-right behaviour is hard, but my method can give you a mask of which elements have conflicts with any other element (e.g. v[0] == v[3] would result in both conflict[0] and conflict[3] being true). This costs only 1 extra shuffle, or maybe 0 with a redesign with this goal in mind.

    (I misread the question at first; I thought you wanted to check both directions, rather than talking about two different implementation options for most of what vpconflictd does. Actually at first I thought you just wanted a presence/absence check, like bool any_conflicts(__m256i).)


    Finding presence/absence of any conflicts: bool any_conflicts32(__m256i)

    8 choose 2 is 28 total scalar comparisons. That's 3.5 vectors of packed comparisons. We should aim to do it with 4 vector compares, which leaves room for some redundancy.

    Creating inputs for those compares will require shuffles, and some of those will have to be lane-crossing. 4 unique comparisons require at least 4 vectors (including the initial unshuffled copy), since 3 choose 2 is only 3.

    Ideally as few as possible of the shuffles are lane-crossing, and there is lots of ILP for the compares and ORing of compare results. Also nice if the shuffles don't need a vector shuffle-control, just an imm8. Also good if they're not slow on AMD Zen 1, where 256b instructions are decoded into multiple 128b uops. (Some shuffles are worse than others for this, e.g. vperm2i128 is very bad; much worse than vpermq for swapping the high and low halves of a single vector. Unfortunately clang gets this wrong even with -mtune=znver1, and compiles _mm256_permute4x64_epi64 into vperm2i128 whenever it can. Zen 2 and later have 256-bit execution units, but still some shuffles are multi-uop until Zen 3 or 4. https://uops.info/)

    I found a solution pretty early that achieves most of these goals: 3 shuffles, 4 compares. One of the shuffles is in-lane. All of them use an immediate control byte instead of a vector.

    // returns a 0 or non-zero truth value
    int any_conflicts32(__m256i v)
    {
        __m256i hilo       = _mm256_permute4x64_epi64(v, _MM_SHUFFLE(1,0,3,2));  // vpermq is much more efficient than vperm2i128 on Zen 1 and KNL, same on HSW/SKL.
        __m256i inlane_rotr1 = _mm256_shuffle_epi32(v, _MM_SHUFFLE(0,3,2,1));
        __m256i full_rotl2 = _mm256_permute4x64_epi64(v, _MM_SHUFFLE(2,1,0,3));
    
        __m256i v_ir1 = _mm256_cmpeq_epi32(v, inlane_rotr1);
        __m256i v_hilo= _mm256_cmpeq_epi32(v, hilo);           // only really needs to be a 128b operation on the low lane, with leaving the upper lane zero.
                                                               // But there's no ideal way to express that with intrinsics, since _mm256_castsi128_si256 technically leaves the high lane undefined
                                                               // It's extremely likely that casting down and back up would always compile to correct code, though (using the result in a zero-extended register).
        __m256i hilo_ir1 = _mm256_cmpeq_epi32(hilo, inlane_rotr1);
        __m256i v_fl2 = _mm256_cmpeq_epi32(v, full_rotl2);
    
        __m256i t1 = _mm256_or_si256(v_ir1, v_hilo);
        __m256i t2 = _mm256_or_si256(t1, v_fl2);
        __m256i conflicts = _mm256_or_si256(t2, hilo_ir1);    // A serial dep chain instead of a tree is probably good because of resource conflicts from limited shuffle throughput
    
        // if you're going to branch on this, movemask/test/jcc is more efficient than ptest/jcc
    
        unsigned conflict_bitmap = _mm256_movemask_epi8(conflicts);  // With these shuffles, positions in the bitmap aren't actually meaningful
        return (bool)conflict_bitmap;
        return conflict_bitmap;
    }
    

    How I designed this:

    I made a table of all the element-pairs that needed to be checked, and made columns for which shuffled operands could take care of that requirement.

    I started with a few shuffles that could be done cheaply, and it turned out my early guesses worked well enough.

    My design notes:

        // 7 6 5 4 | 3 2 1 0
    
        // h g f e | d c b a
        // e h g f | a d c b    // inlanerotr1 = vpshufd(v)
        // f e d c | b a h g    // fullrotl2 = vpermq(v)
    
        // d c b a | h g f e    // hilo = vperm2i128(v) or vpermq.  v:hilo has lots of redundancy.  The low half has all the information.
    
              v:lrot1      v:frotr2     lrotr1:frotl2                (incomplete)
     * ab   [0]v:lrotr1                 [3]lr1:fl2
     * ac                  [2]v:frotl2
     * ad   [3]v:lrotr1                 [2]lr1:fl2
     * ae                                                                           [0,4]v:hilo
     * af                                           [4]hilo:lrotr1
     * ag                  [0]v:frotl2
     * ah                                           [3]hilo:lrotr1
     
     * bc   [1]v:lrotr1
     * bd                  [3]v:frotl2                               [5]hilo:frotl2
     * be                                           [0]hilo:lrotr1
     * bf                                                                           [1,5]v:hilo
     * bg                               [0]lr1:fl2  [5]hilo:lrotr1
     * bh                  [1]v:frotl2
    
     * cd   [2]v:lrotr1
     * ce                  [4]v:frotl2  [4]lr1:fl2
     * cf                                           [1]hilo:lrotr1
     * cg                                                                           [2,6]v:hilo
     * ch                               [1]lr1:fl2  [6]hilo:lrotr1
    
     * de                                           [7]hilo:lrotr1
     * df                  [5]v:frotl2                               [7]hilo:frotl2
     * dg                               [5]lr1:fl2  [2]hilo:lrotr1
     * dh                                                                           [3,7]v:hilo
    
     * ef   [4]v:lrotr1                 [7]lr1:fl2
     * eg                  [6]v:frotl2
     * eh   [7]v:lrotr1                 [6]lr1:fl2
    
     * fg   [5]v:lrotr1
     * fh                  [7]v:frotl2
    
     * gh   [6]v:lrotr1
    
     */
    

    It turns out that in-lane rotr1 == full rotl2 has a lot of redundancy, so it's not worth using. It also turns out that having all the allowed redundancy in v==hilo works fine.

    If you care about which result is in which element (rather than just checking for presence/absence), then v == swap_hilo(lrotr1) could work instead of lrotr1 == hilo. But we also need swap_hilo(v), so this would mean an extra shuffle.

    We could instead shuffle after hilo==lrotr1, for better ILP. Or maybe there's a different set of shuffles that gives us everything. Maybe if we consider VPERMD with a vector shuffle-control...


    Compiler asm output vs. optimal asm

    gcc6.3 -O3 -march=haswell produces:

    Haswell has one shuffle unit (on port5).

       # assume ymm0 ready on cycle 0
        vpermq  ymm2, ymm0, 78     # hilo ready on cycle 3 (execution started on cycle 0)
        vpshufd ymm3, ymm0, 57     # lrotr1 ready on cycle 2  (started on cycle 1)
        vpermq  ymm1, ymm0, 147    # frotl2 ready on cycle 5  (started on 2)
        vpcmpeqd  ymm4, ymm2, ymm0  # starts on 3, ready on 4
        vpcmpeqd  ymm1, ymm1, ymm0  # starts on 5, ready on 6
        vpcmpeqd  ymm2, ymm2, ymm3  # starts on 3, ready on 4
        vpcmpeqd  ymm0, ymm0, ymm3  # starts on 2, ready on 3
        vpor    ymm1, ymm1, ymm4    # starts on 6, ready on 7
        vpor    ymm0, ymm0, ymm2    # starts on 4, ready on 5
        vpor    ymm0, ymm1, ymm0    # starts on 7, ready on 8
             # a different ordering of VPOR merging could have saved a cycle here.  /scold gcc
        vpmovmskb       eax, ymm0
        vzeroupper
        ret
    

    So the best-case latency is 8 cycles to have a single vector ready, given resource conflicts from other instructions in this sequence but assuming no conflicts with past instructions still in the pipeline. (Should have been 7 cycles, but gcc re-ordered the dependency structure of my intrinsics putting more stuff dependent on the compare of the last shuffle result.)

    This is faster than Skylake-AVX512's vpconflictd ymm, which has 17c latency, one per 10c throughput. (Of course, that gives you much more information, and @harold's emulation of it takes many more instructions).

    Fortunately gcc didn't re-order the shuffles and introduce a potential write-back conflict. (e.g. putting the vpshufd last would mean that dispatching the shuffle uops to port5 in oldest-first order would have the vpshufd ready in the same cycle as the first vpermq (1c latency vs. 3c).) gcc did this for one version of the code (where I compared the wrong variable), so it seems that gcc -mtune=haswell doesn't take this into account. (Maybe it's not a big deal, I haven't measured to see what the real effect on latency is. I know the scheduler is smart about picking uops from the Reservation Station to avoid actual write-back conflicts, but IDK how smart it is, i.e. whether it would run the vpshufd ahead of a later vpermq to avoid a write-back conflict, since it would have to look-ahead to even see the upcoming writeback conflict. More likely it would just delay the vpshufd for an extra cycle before dispatching it.)

    Anyway, this is why I put _mm_shuffle_epi32 in the middle in the C source, where it makes things easy for OOO execution.

    Clang 4.0 goes berserk and packs each compare result down to 128b vectors (with vextracti128 / vpacksswb), then expands back to 256b after three vpor xmm before pmovmskb. I thought at first it was doing this because of -mtune=znver1, but it does it with -mtune=haswell as well. It does this even if we return a bool, which would let it just pmovmskb / test on the packed vector. /facepalm. It also pessimizes the hilo shuffle to vperm2i128, even with -mtune=znver1 (Zen 1), where vperm2i128 is 8 uops but vpermq is 3. (Agner Fog's insn tables for some reasons missed those, so I took those numbers from the FP equivalents vperm2f128 and vpermpd)

    @harold says that using add instead of or stops clang from packing/unpacking, but vpaddd has lower throughput than vpor on Intel pre-Skylake.

    Even better for Zen 1, the v == hilo compare can do only the low half. (i.e. use vpcmpeqd xmm2, xmm2, xmm3, which is only 1 uop instead of 2). We still need the full hilo for hilo == lrot1, though. So we can't just use vextracti128 xmm2, xmm0, 1 instead of the vpermq shuffle. vextracti128 has excellent performance on Zen 1: 1 uop, 1c latency, 0.33c throughput (can run on any of P0/1/3).

    Since we're ORing everything together, it's fine to have zeros instead of redundant compare results in the high half.

    As I noted in comments, IDK how to safely write this with intrinsics. The obvious way would be to use _mm256_castsi128_si256 (_mm_cmpeq_epi32(v, hilo)), but that technically leaves the high lane undefined, rather than zero. There's no sane way a compiler would do anything other than use the full-width ymm register that contains the xmm register with the 128b compare result, but it would be legal according to Intel's docs for a Deathstation-9000 compiler to put garbage there. Any explicit way of getting zeros in the high half would depend on the compiler optimizing it away. Maybe _mm256_setr_si128(cmpresult, _mm_setzero_si128());.


    There are no current CPUs with AVX512F but not AVX512CD. But if that combo is interesting or relevant, clang makes some interesting asm from my code with -mavx512f -mavx512vl. It uses EVEX vpcmpeqd into mask registers, and korw to merge them. But then it expands that back into a vector to set up for vpmovmaskb, instead of just optimizing away the movemask and using the korw result. /facepalm.

    vpconflictd isn't particularly fast on Intel CPUs, though, like 23 uops for YMM in Alder Lake / Sapphire Rapids, 37 uops for ZMM. It is fast in Zen 4 and 5: 2 uops for YMM in Zen 4. https://uops.info/

    Fun fact: vp2intersect was Tiger Lake only and slow, but AMD brought it back for Zen 5 where it's fast. (http://users.atw.hu/instlatx64/AuthenticAMD/AuthenticAMD0B40F40_K20_GraniteRidge_02_512b_InstLatX64.txt)