I cannot see this in the Intel Intrinsic guide, but perhaps I have missed it.
If I have two 512 bit registers a
and b
I'd like to treat these as having four 128 bit elements and then perform:
a[0] + b[0]
a[1] + b[1]
a[2] + b[2]
a[3] + b[3]
Does such an AVX 512 instruction exist?
No, widest element size for any math ops is 64-bit. AVX-512 has unsigned integer compares so you could easily generate a mask of elements that had carry-out (carry = (a+b) < a
).
Then maybe shift the mask (kshiftlb
) and use it for a merge-masked add
(or sub
of -1
) to increment the high 64 bits of elements where the low half had carry-out. (With _mm512_setr_epi64(0, -1, 0, -1, 0, -1, 0, -1)
so carry-out from high halves adds 0
to the low half of the next 128-bit element, so you don't have to worry about shifting carry across element boundaries.) set1(-1)
is cheap to generate, but compilers might not be clever about that 0, -1
pattern with vpternlogq
/ vpslldq zmm, zmm, 8
so you might just use add
with a 0, 1
pattern.
Or maybe some other trickery, perhaps to generate a 0
or -1
in a vector reg and shuffle instead of going through mask regs? But saturating subtract is only available with 8 or 16-bit elements.
The key difference from a single BigInt add is that each carry-out has to propagate only one step, not ripple all the way to the end, so there isn't a long serial dependency.
#include <immintrin.h>
__m512i add_epi128(__m512i a, __m512i b)
{
__m512i sum = _mm512_add_epi64(a, b);
__mmask8 carry = _mm512_cmplt_epu64_mask(sum, b); // a or b doesn't matter, but compilers don't realize that.
// For the standalone function, using b lets them overwrite a in ZMM0
// Of course in reality you want this function to inline.
//carry = _kshiftli_mask8(carry, 1); // or actually just kaddb is more compact, but compilers miss the optimization
carry = _kadd_mask8(carry, carry);
//carry += carry;
const __m512i high_ones = _mm512_setr_epi64(0, -1, 0, -1, 0, -1, 0, -1);
// Carry-propagation into the high half of each u128, with merge-masking
sum = _mm512_mask_sub_epi64(sum, carry, sum, high_ones);
return sum;
}
kaddb
and kshiftlb
both have 4-cycle latency on Intel, 1 on Zen 4 (https://uops.info/), but kaddb
is one byte shorter (no immediate). I had to use an intrinsic to get GCC and clang to emit it instead of kshiftlb
. (Godbolt)
# clang 18 -O3 -march=x86-64-v4
.LCPI0_0:
.quad 0
.quad 1
...
add_epi128(long long vector[8], long long vector[8]):
vpaddq zmm0, zmm1, zmm0
vpcmpltuq k0, zmm0, zmm1
kaddb k1, k0, k0
vpaddq zmm0 {k1}, zmm0, zmmword ptr [rip + .LCPI0_0]
ret
Assuming this inlines and keeps the vector constant in a register, this is 4 uops for the front end and back-end execution units. And critical-path latency is about 10 cycles on Intel from when both vector inputs are ready. (1 for the vpaddq
instructions, even with merge-masking. 4 or 3 for compare-into-mask vs. 5 on Zen4. 4 for kaddb
on Intel vs. 1 on Zen4.)
If you're doing a lot of this in parallel, e.g. on an array, out-of-order exec should have little trouble finding instruction-level parallelism and keeping execution ports 0 and 5 busy every cycle, for 2 cycle throughput, but the latency sucks so you want to use multiple accumulator vectors if you're e.g. summing an array.
Or in a case like that, accumulate the carry-out separately and add it back into the high half at the end so you don't have any long loop-carried dep chains.