cperformancesortingcomparesorting-network

Comparing two pairs of 4 variables and returning the number of matches?


Given the following struct:

struct four_points {
    uint32_t a, b, c, d;
}

What would be the absolute fastest way to compare two such structures and return the number of variables that match (in any position)?

For example:

four_points s1 = {0, 1, 2, 3};
four_points s2 = {1, 2, 3, 4};

I'd be looking for a result of 3, since three numbers match between the two structs. However, given the following:

four_points s1 = {1, 0, 2, 0};
four_points s2 = {0, 1, 9, 7};

Then I'd expect a result of only 2, because only two variables match between either struct (despite there being two zeros in the first).

I've figured out a few rudimentary systems for performing the comparison, but this is something that is going to be called a couple million times in a short time span and needs to be relatively quick. My current best attempt was to use a sorting network to sort all four values for either input, then loop over the sorted values and keep a tally of the values that are equal, advancing the current index of either input accordingly.

Is there any kind of technique that might be able to perform better then a sort & iteration?


Solution

  • On modern CPUs, sometimes brute force applied properly is the way to go. The trick is writing code that isn't limited by instruction latencies, just throughput.


    Are duplicates common? If they're very rare, or have a pattern, using a branch to handle them makes the common case faster. If they're really unpredictable, it's better to do something branchless. I was thinking about using a branch to check for duplicates between positions where they're rare, and going branchless for the more common place.

    Benchmarking is tricky because a version with branches will shine when tested with the same data a million times, but will have lots of branch mispredicts in real use.


    I haven't benchmarked anything yet, but I have come up with a version that skips duplicates by using OR instead of addition to combine found-matches. It compiles to nice-looking x86 asm that gcc fully unrolls. (no conditional branches, not even loops).

    Here it is on godbolt. (g++ is dumb and uses 32bit operations on the output of x86 setcc, which only sets the low 8 bits. This partial-register access will produce slowdowns. And I'm not even sure it ever zeroes the upper 24bits at all... Anyway, the code from gcc 4.9.2 looks good, and so does clang on godbolt)

    // 8-bit types used because x86's setcc instruction only sets the low 8 of a register
    // leaving the other bits unmodified.
    // Doing a 32bit add from that creates a partial register slowdown on Intel P6 and Sandybridge CPU families
    // Also, compilers like to insert movzx (zero-extend) instructions
    // because I guess they don't realize the previous high bits are all zero.
    // (Or they're tuning for pre-sandybridge Intel, where the stall is worse than SnB inserting the extra uop itself).
    
    // The return type is 8bit because otherwise clang decides it should generate
    // things as 32bit in the first place, and does zero-extension -> 32bit adds.
    int8_t match4_ordups(const four_points *s1struct, const four_points *s2struct)
    {
        const int32_t *s1 = &s1struct->a; // TODO: check if this breaks aliasing rules
        const int32_t *s2 = &s2struct->a;
        // ignore duplicates by combining with OR instead of addition
        int8_t matches = 0;
    
        for (int j=0 ; j<4 ; j++) {
            matches |= (s1[0] == s2[j]);
        }
    
        for (int i=1; i<4; i++) { // i=0 iteration is broken out above
            uint32_t s1i = s1[i];
    
            int8_t notdup = 1; // is s1[i] a duplicate of s1[0.. i-1]?
            for (int j=0 ; j<i ; j++) {
                notdup &= (uint8_t) (s1i != s1[j]);  // like dup |= (s1i == s1[j]); but saves a NOT
            }
    
            int8_t mi = // match this iteration?
                (s1i == s2[0]) |
                (s1i == s2[1]) |
                (s1i == s2[2]) |
                (s1i == s2[3]);
        // gcc and clang insist on doing 3 dependent OR insns regardless of parens, not that it matters
    
            matches += mi & notdup;
        }
        return matches;
    }
    
    // see the godbolt link for a main() simple test harness.
    

    On a machine with 128b vectors that can work with 4 packed 32bit integers (e.g. x86 with SSE2), you can broadcast each element of s1 to its own vector, deduplicate, and then do 4 packed-compares. icc does something like this to autovectorize my match4_ordups function (check it out on godbolt.)

    Store the compare results back to integer registers with movemask, to get a bitmap of which elements compared equal. Popcount those bitmaps, and add the results.


    This led me to a better idea: Getting all the compares done with only 3 shuffles with element-wise rotation:

    { 1d 1c 1b 1a }
      == == == ==   packed-compare with
    { 2d 2c 2b 2a }
    
    { 1a 1d 1c 1b }
      == == == ==   packed-compare with
    { 2d 2c 2b 2a }
    
    { 1b 1a 1d 1c }  # if dups didn't matter: do this shuffle on s2
      == == == ==   packed-compare with
    { 2d 2c 2b 2a }
    
    { 1c 1b 1a 1d } # if dups didn't matter: this result from { 1a ... }
      == == == ==   packed-compare with
    { 2d 2c 2b 2a }                                           { 2b ...
    

    That's only 3 shuffles, and still does all 16 comparisons. The trick is combining them with ORs where we need to merge duplicates, and then being able to count them efficiently. A packed-compare outputs a vector with each element = zero or -1 (all bits set), based on the comparison between the two elements in that position. It's designed to make a useful operand to AND or XOR to mask off some vector elements, e.g. to make v1 += v2 & mask conditional on a per-element basis. It also works as just a boolean truth value.

    All 16 compares with only 2 shuffles is possible by rotating one vector by two, and the other vector by one, and then comparing between the four shifted and unshifted vectors. This would be great if we didn't need to eliminate dups, but since we do, it matters which results are where. We're not just adding all 16 comparison results.

    OR together the packed-compare results down to one vector. Each element will be set based on whether that element of s2 had any matches in s1. int _mm_movemask_ps (__m128 a) to turn the vector into a bitmap, then popcount the bitmap. (Nehalem or newer CPU required for popcnt, otherwise fall back to a version with a 4-bit lookup table.)

    The vertical ORs take care of duplicates in s1, but duplicates in s2 is a less obvious extension, and would take more work. I did eventually think of a way that was less than twice as slow (see below).

    #include <stdint.h>
    #include <immintrin.h>
    
    typedef struct four_points {
        int32_t a, b, c, d;
    } four_points;
    //typedef uint32_t four_points[4];
    
    // small enough to inline, only 62B of x86 instructions (gcc 4.9.2)
    static inline int match4_sse_noS2dup(const four_points *s1pointer, const four_points *s2pointer)
    {
        __m128i s1 = _mm_loadu_si128((__m128i*)s1pointer);
        __m128i s2 = _mm_loadu_si128((__m128i*)s2pointer);
        __m128i s1b= _mm_shuffle_epi32(s1, _MM_SHUFFLE(0, 3, 2, 1));
        // no shuffle needed for first compare
        __m128i match = _mm_cmpeq_epi32(s1 , s2);  //{s1.d==s2.d?-1:0, 1c==2c, 1b==2b, 1a==2a }
        __m128i s1c= _mm_shuffle_epi32(s1, _MM_SHUFFLE(1, 0, 3, 2));
        s1b = _mm_cmpeq_epi32(s1b, s2);
        match = _mm_or_si128(match, s1b);  // merge dups by ORing instead of adding
    
        // note that we shuffle the original vector every time
        // multiple short dependency chains are better than one long one.
        __m128i s1d= _mm_shuffle_epi32(s1, _MM_SHUFFLE(2, 1, 0, 3));
        s1c = _mm_cmpeq_epi32(s1c, s2);
        match = _mm_or_si128(match, s1c);
        s1d = _mm_cmpeq_epi32(s1d, s2);
    
        match = _mm_or_si128(match, s1d);    // match = { s2.a in s1?,  s2.b in s1?, etc. }
    
        // turn the the high bit of each 32bit element into a bitmap of s2 elements that have matches anywhere in s1
        // use float movemask because integer movemask does 8bit elements.
        int matchmask = _mm_movemask_ps (_mm_castsi128_ps(match));
    
        return _mm_popcnt_u32(matchmask);  // or use a 4b lookup table for CPUs with SSE2 but not popcnt
    }
    

    See the version that eliminates duplicates in s2 for the same code with lines in a more readable order. I tried to schedule instructions in case the CPU was only just barely decoding instructions ahead of what was executing, but gcc puts the instructions in the same order regardless of what order you put the intrinsics in.

    This is extremely fast, if there isn't a store-forwarding stall in the 128b loads. If you just wrote the struct with four 32bit stores, running this function within the next several clock cycles will produce a stall when it tries to load the whole struct with a 128b load. See Agner Fog's site. If calling code already has many of the 8 values in registers, the scalar version could be a win, even though it'll be slower for a microbenchmark test that only reads the structs from memory.

    I got lazy on cycle-counting for this, since dup-handling isn't done yet. IACA says Haswell can run it with a throughput of one iteration per 4.05 clock cycles, and latency of 17 cycles (Not sure if that's including the memory latency of the loads. There's a lot of instruction-level parallelism available, and all the instructions have single-cycle latency, except for movmsk(2) and popcnt(3)). It's slightly slower without AVX, because gcc chooses a worse instruction ordering, and still wastes a movdqa instruction copying a vector register.

    With AVX2, this could do two match4 operations in parallel, in 256b vectors. AVX2 usually works as two 128b lanes, rather than full 256b vectors. Setting up your code to be able to take advantage of 2 or 4 (AVX-512) match4 operations in parallel will give you gains when you can compile for those CPUs. It's not essential for both the s1s or s2s to be stored contiguously so a single 32B load can get two structs. AVX2 has a fairly fast load 128b to the upper lane of a register.


    Handling duplicates in s2

    Maybe compare s2 to a shifted instead of rotated version of itself.

    #### comparing S2 with itself to mask off duplicates
    {  0 2d 2c 2b }
    { 2d 2c 2b 2a }     == == ==
    
    {  0  0 2d 2c }
    { 2d 2c 2b 2a }        == ==
    
    {  0  0  0 2d }
    { 2d 2c 2b 2a }           ==
    

    Hmm, if zero can occur as a regular element, we may need to byte-shift after the compare as well, to turn potential false positives into zeros. If there was a sentinel value that couldn't appear in s1, you could shift in elements of that, instead of 0. (SSE has PALIGNR, which gives you any contiguous 16B window you want of the contents of two registers appended. Named for the use-case of simulating an unaligned load from two aligned loads. So you'd have a constant vector of that element.)


    update: I thought of a nice trick that avoids the need for an identity element. We can actually get all 6 necessary s2 vs. s2 comparisons to happen with just two vector compares, and then combine the results.

    Checking for duplicates is a big slowdown (esp. in throughput, not so much in latency). So you're still best off arranging for a sentinel value in s2 that will never match any s1 element, which you say is possible. I only present this because I thought it was interesting. (And gives you an option in case you need a version that doesn't require sentinels sometime.)

    static inline
    int match4_sse(const four_points *s1pointer, const four_points *s2pointer)
    {
        // IACA_START
        __m128i s1 = _mm_loadu_si128((__m128i*)s1pointer);
        __m128i s2 = _mm_loadu_si128((__m128i*)s2pointer);
        // s1a = unshuffled = s1.a in the low element
        __m128i s1b= _mm_shuffle_epi32(s1, _MM_SHUFFLE(0, 3, 2, 1));
        __m128i s1c= _mm_shuffle_epi32(s1, _MM_SHUFFLE(1, 0, 3, 2));
        __m128i s1d= _mm_shuffle_epi32(s1, _MM_SHUFFLE(2, 1, 0, 3));
    
        __m128i match = _mm_cmpeq_epi32(s1 , s2);  //{s1.d==s2.d?-1:0, 1c==2c, 1b==2b, 1a==2a }
        s1b = _mm_cmpeq_epi32(s1b, s2);
        match = _mm_or_si128(match, s1b);  // merge dups by ORing instead of adding
    
        s1c = _mm_cmpeq_epi32(s1c, s2);
        match = _mm_or_si128(match, s1c);
        s1d = _mm_cmpeq_epi32(s1d, s2);
        match = _mm_or_si128(match, s1d);
        // match = { s2.a in s1?,  s2.b in s1?, etc. }
    
        // s1 vs s2 all done, now prepare a mask for it based on s2 dups
    
    /*
     * d==b   c==a   b==a  d==a   #s2b
     * d==c   c==b   b==a  d==a   #s2c
     *    OR together -> s2bc
     *  d==abc     c==ba    b==a    0  pshufb(s2bc) (packed as zero or non-zero bytes within the each element)
     * !(d==abc) !(c==ba) !(b==a)  !0   pcmpeq setzero -> AND mask for s1_vs_s2 match
     */
        __m128i s2b = _mm_shuffle_epi32(s2, _MM_SHUFFLE(1, 0, 0, 3));
        __m128i s2c = _mm_shuffle_epi32(s2, _MM_SHUFFLE(2, 1, 0, 3));
        s2b = _mm_cmpeq_epi32(s2b, s2);
        s2c = _mm_cmpeq_epi32(s2c, s2);
    
        __m128i s2bc= _mm_or_si128(s2b, s2c);
        s2bc = _mm_shuffle_epi8(s2bc, _mm_set_epi8(-1,-1,0,12,  -1,-1,-1,8, -1,-1,-1,4,  -1,-1,-1,-1));
        __m128i dupmask = _mm_cmpeq_epi32(s2bc, _mm_setzero_si128());
        // see below for alternate insn sequences that can go here.
    
        match = _mm_and_si128(match, dupmask);
        // turn the the high bit of each 32bit element into a bitmap of s2 matches
        // use float movemask because integer movemask does 8bit elements.
        int matchmask = _mm_movemask_ps (_mm_castsi128_ps(match));
    
        int ret = _mm_popcnt_u32(matchmask);  // or use a 4b lookup table for CPUs with SSE2 but not popcnt
        // IACA_END
        return ret;
    }
    

    This requires SSSE3 for pshufb. It and a pcmpeq (and a pxor to generate a constant) are replacing a shuffle (bslli(s2bc, 12)), an OR, and an AND.

    d==bc  c==ab b==a a==d = s2b|s2c
    d==a   0     0    0    = byte-shift-left(s2b) = s2d0
    d==abc c==ab b==a a==d = s2abc
    d==abc c==ab b==a 0    = mask(s2abc).  Maybe use PBLENDW or MOVSS from s2d0 (which we know has zeros) to save loading a 16B mask.
    
    __m128i s2abcd = _mm_or_si128(s2b, s2c);
    //s2bc = _mm_shuffle_epi8(s2bc, _mm_set_epi8(-1,-1,0,12,  -1,-1,-1,8, -1,-1,-1,4,  -1,-1,-1,-1));
    //__m128i dupmask = _mm_cmpeq_epi32(s2bc, _mm_setzero_si128());
    __m128i s2d0 = _mm_bslli_si128(s2b, 12);  // d==a  0  0  0
    s2abcd = _mm_or_si128(s2abcd, s2d0);
    __m128i dupmask = _mm_blend_epi16(s2abcd, s2d0, 0 | (2 | 1));
    //__m128i dupmask = _mm_and_si128(s2abcd, _mm_set_epi32(-1, -1, -1, 0));
    
    match = _mm_andnot_si128(dupmask, match);  // ~dupmask & match;  first arg is the one that's inverted
    

    I can't recommend MOVSS; it will incur extra latency on AMD because it runs in the FP domain. PBLENDW is SSE4.1. popcnt is available on AMD K10, but PBLENDW isn't (some Barcelona-core PhenomII CPUs are probably still in use). Actually, K10 doesn't have PSHUFB either, so just require SSE4.1 and POPCNT, and use PBLENDW. (Or use the PSHUFB version, unless it's going to cache-miss a lot.)

    Another option to avoid a loading a vector constant from memory is to movemask s2bc, and use integer instead of vector ops. However, it looks like that'll be slower, because the extra movemask isn't free, and integer ANDN isn't usable. BMI1 didn't appear until Haswell, and even Skylake Celerons and Pentiums won't have it. (Very annoying, IMO. It means compilers can't start using BMI for even longer.)

    unsigned int dupmask = _mm_movemask_ps(cast(s2bc));
    dupmask |= dupmask << 3;  // bit3 = d==abc.  garbage in bits 4-6, careful if using AVX2 to do two structs at once
            // only 2 instructions.  compiler can use lea r2, [r1*8] to copy and scale
    dupmask &= ~1;  // clear the low bit
    
    unsigned int matchmask = _mm_movemask_ps(cast(match));
    matchmask &= ~dupmask;   // ANDN is in BMI1 (Haswell), so this will take 2 instructions
    return _mm_popcnt_u32(matchmask);
    

    AMD XOP's VPPERM (pick bytes from any element of two source registers) would let the byte-shuffle replace the OR that merges s2b and s2c as well.

    Hmm, pshufb isn't saving me as much as I thought, because it requires a pcmpeqd, and a pxor to zero a register. It's also loading its shuffle mask from a constant in memory, which can miss in the D-cache. It is the fastest version I've come up with, though.

    If inlined into a loop, the same zeroed register could be used, saving one instruction. However, OR and AND can run on port0 (Intel CPUs), which can't run shuffle or compare instructions. The PXOR doesn't use any execution ports, though (on Intel SnB-family microarchitecture).

    I haven't run real benchmarks of any of these, only IACA.

    The PBLENDW and PSHUFB versions have the same latency (22 cycles, compiled for non-AVX), but the PSHUFB version has better throughput (one per 7.1c, vs. one per 7.4c, because PBLENDW needs the shuffle port, and there's already a lot of contention for it.) IACA says the version using PANDN with a constant instead of PBLENDW is also one-per-7.4c throughput, disappointingly. Port0 isn't saturated, so IDK why it's as slow as PBLENDW.


    Old ideas that didn't pan out.

    Leaving them in for the benefit of people looking for things to try when using vectors for related things.

    Dup-checking s2 with vectors is more work than checking s2 vs. s1, because one compare is as expensive as 4 if done with vectors. The shuffling or masking needed after the compare, to remove false positives if there's no sentinel value, is annoying.

    Ideas so far: