How to you check if an aligned chunk of 16 u32
's is consecutive (and increasing)?
For example: [100, 101, 102, ..., 115]
is.
And, [100, 99, 3 ...]
is not.
I'm on AVX512f. This is what I have so far:
Algo A:
* predefine DECREASE_U32, a u32x16 of [15,14,13,...0]
* let a = input + DECREASE_32 // wrapping is OK
* compare a to u32x16::splat(first_item(a))
* Return whether all true
Alterative (Algo B)
* let b = copy of A
* permute the elements of b by one position
* let b = a-b
* Is b all 1's (except for 1st position)
I'm doing this in Rust with the packed_simd
crate, but any language/pseudocode` is fine. (I wish there was a SIMD operation to subtract adjacent items.)
I think your first idea is probably best if done inside a loop that can amortize the cost of loading a vector constant. AVX-512 can do that efficiently.
Either with a vector load and then separately broadcast the low element with vpbroadcastd
, or with a vector load and a broadcast-load. e.g. vpaddd zmm16, zmm31, [rdi]{1to16}
/ vpcmpeqd k1, zmm16, [rdi]
.
Hmm, but then checking for all elements being true, I guess perhaps kaddw
with a constant 1
and check that the low 16 bits are zero with kortest
? Or just kmov
to an integer register for a compare against 0xffff
like we'd do with SSE/AVX pmovmskb
. I tried that, and clang had a better idea: compare for not-equal, and check that the mask is all zero. (i.e. check that every element is equal by checking that they aren't not-equal.) That allows kortest
on the mask itself. I applied clang's idea to my intrinsics so GCC could make better asm as well.
In C++:
#include <immintrin.h>
// compare for not-equal, checking the mask for 0
bool check_contig(int *p)
{
__m512i bcast_first = _mm512_set1_epi32(*p);
__m512i desired = _mm512_add_epi32(bcast_first, _mm512_setr_epi32(15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0));
__m512i v = _mm512_loadu_si512(p);
__mmask16 cmp = _mm512_cmpneq_epi32_mask(desired, v);
return cmp == 0;
}
Godbolt - asm from GCC and clang:
# GCC
check_contig(int*):
vmovdqa32 zmm0, ZMMWORD PTR .LC0[rip]
vpaddd zmm0, zmm0, DWORD PTR [rdi]{1to16}
vpcmpd k0, zmm0, ZMMWORD PTR [rdi], 4
kortestw k0, k0
sete al
vzeroupper
ret
# clang
check_contig(int*):
vpbroadcastd zmm0, dword ptr [rdi]
vpaddd zmm0, zmm0, zmmword ptr [rip + .LCPI0_0]
vpcmpneqd k0, zmm0, zmmword ptr [rdi]
kortestw k0, k0
sete al
vzeroupper
ret
So they both choose to load twice instead of vpbroadcastd zmm1, xmm0
, at least when not in a loop so the vector constant also has to get loaded from .rodata
.
Perhaps if I wrote it differently, as _mm512_broadcastd_epi32( _mm512_castsi512_si128(v))
, they'd prefer one load, at the cost of an extra shuffle uop. (Which is probably worse when you have 512-bit uops in flight, so Intel CPUs shut down the vector ALU on port 1, leaving only ports 0 and 5. https://agner.org/optimize/ and https://uops.info/)
Maybe your second way could also be done efficiently with valignd
to rotate the vector; the only vector constant it needs is all-ones which can be generated somewhat more cheaply (vpternlogd
) instead of loaded.
Checking the compare-mask would probably require a kmov
to integer for an and
+ cmp
to check all but one bit, unless we can use the same trick clang did and arrange things so we actually want the mask to be all-zero in the places we want. In that case, test eax, imm32
can check the bits we want while ignoring the one we don't.