rustsimdavxavx512

SIMD algorithm to check of if an integer block is "consecutive."


How to you check if an aligned chunk of 16 u32's is consecutive (and increasing)?

For example: [100, 101, 102, ..., 115] is. And, [100, 99, 3 ...] is not.

I'm on AVX512f. This is what I have so far:

Algo A:

* predefine DECREASE_U32, a u32x16 of [15,14,13,...0]
* let a = input + DECREASE_32 // wrapping is OK
* compare a to u32x16::splat(first_item(a))
* Return whether all true

Alterative (Algo B)

* let b = copy of A
* permute the elements of b by one position
* let b = a-b
* Is b all 1's (except for 1st position)

I'm doing this in Rust with the packed_simd crate, but any language/pseudocode` is fine. (I wish there was a SIMD operation to subtract adjacent items.)


Solution

  • I think your first idea is probably best if done inside a loop that can amortize the cost of loading a vector constant. AVX-512 can do that efficiently.

    Either with a vector load and then separately broadcast the low element with vpbroadcastd, or with a vector load and a broadcast-load. e.g. vpaddd zmm16, zmm31, [rdi]{1to16} / vpcmpeqd k1, zmm16, [rdi].

    Hmm, but then checking for all elements being true, I guess perhaps kaddw with a constant 1 and check that the low 16 bits are zero with kortest? Or just kmov to an integer register for a compare against 0xffff like we'd do with SSE/AVX pmovmskb. I tried that, and clang had a better idea: compare for not-equal, and check that the mask is all zero. (i.e. check that every element is equal by checking that they aren't not-equal.) That allows kortest on the mask itself. I applied clang's idea to my intrinsics so GCC could make better asm as well.

    In C++:

    #include <immintrin.h>
    
    // compare for not-equal, checking the mask for 0
    bool check_contig(int *p)
    {
        __m512i bcast_first = _mm512_set1_epi32(*p);
        __m512i desired = _mm512_add_epi32(bcast_first, _mm512_setr_epi32(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0));
    
        __m512i v = _mm512_loadu_si512(p);
        __mmask16 cmp = _mm512_cmpneq_epi32_mask(desired, v);
        return cmp == 0;
    }
    

    Godbolt - asm from GCC and clang:

    # GCC
    check_contig(int*):
            vmovdqa32       zmm0, ZMMWORD PTR .LC0[rip]
            vpaddd  zmm0, zmm0, DWORD PTR [rdi]{1to16}
            vpcmpd  k0, zmm0, ZMMWORD PTR [rdi], 4
            kortestw        k0, k0
            sete    al
            vzeroupper
            ret
    
    # clang
    check_contig(int*):
            vpbroadcastd    zmm0, dword ptr [rdi]
            vpaddd  zmm0, zmm0, zmmword ptr [rip + .LCPI0_0]
            vpcmpneqd       k0, zmm0, zmmword ptr [rdi]
            kortestw        k0, k0
            sete    al
            vzeroupper
            ret
    

    So they both choose to load twice instead of vpbroadcastd zmm1, xmm0, at least when not in a loop so the vector constant also has to get loaded from .rodata.

    Perhaps if I wrote it differently, as _mm512_broadcastd_epi32( _mm512_castsi512_si128(v)), they'd prefer one load, at the cost of an extra shuffle uop. (Which is probably worse when you have 512-bit uops in flight, so Intel CPUs shut down the vector ALU on port 1, leaving only ports 0 and 5. https://agner.org/optimize/ and https://uops.info/)


    Algo B - avoiding a non-trivial vector constant

    Maybe your second way could also be done efficiently with valignd to rotate the vector; the only vector constant it needs is all-ones which can be generated somewhat more cheaply (vpternlogd) instead of loaded.

    Checking the compare-mask would probably require a kmov to integer for an and + cmp to check all but one bit, unless we can use the same trick clang did and arrange things so we actually want the mask to be all-zero in the places we want. In that case, test eax, imm32 can check the bits we want while ignoring the one we don't.