optimizationx86intelsimdavx512

AVX 512 intrinsics to add 512 bits of 128 bit elements


I cannot see this in the Intel Intrinsic guide, but perhaps I have missed it.

If I have two 512 bit registers a and b I'd like to treat these as having four 128 bit elements and then perform:

a[0] + b[0]

a[1] + b[1]

a[2] + b[2]

a[3] + b[3]

Does such an AVX 512 instruction exist?


Solution

  • No, widest element size for any math ops is 64-bit. AVX-512 has unsigned integer compares so you could easily generate a mask of elements that had carry-out (carry = (a+b) < a).

    Then maybe shift the mask (kshiftlb) and use it for a merge-masked add (or sub of -1) to increment the high 64 bits of elements where the low half had carry-out. (With _mm512_setr_epi64(0, -1, 0, -1, 0, -1, 0, -1) so carry-out from high halves adds 0 to the low half of the next 128-bit element, so you don't have to worry about shifting carry across element boundaries.) set1(-1) is cheap to generate, but compilers might not be clever about that 0, -1 pattern with vpternlogq / vpslldq zmm, zmm, 8 so you might just use add with a 0, 1 pattern.

    Or maybe some other trickery, perhaps to generate a 0 or -1 in a vector reg and shuffle instead of going through mask regs? But saturating subtract is only available with 8 or 16-bit elements.

    The key difference from a single BigInt add is that each carry-out has to propagate only one step, not ripple all the way to the end, so there isn't a long serial dependency.

    #include <immintrin.h>
    
    __m512i add_epi128(__m512i a, __m512i b)
    {
       __m512i sum = _mm512_add_epi64(a, b);
       __mmask8 carry = _mm512_cmplt_epu64_mask(sum, b);  // a or b doesn't matter, but compilers don't realize that. 
                                                         // For the standalone function, using b lets them overwrite a in ZMM0
                                                         // Of course in reality you want this function to inline.
    
       //carry = _kshiftli_mask8(carry, 1);  // or actually just kaddb is more compact, but compilers miss the optimization
       carry = _kadd_mask8(carry, carry);
       //carry += carry;
    
       const __m512i high_ones = _mm512_setr_epi64(0, -1, 0, -1, 0, -1, 0, -1);
       // Carry-propagation into the high half of each u128, with merge-masking
       sum = _mm512_mask_sub_epi64(sum, carry, sum, high_ones);
       return sum;
    }
    

    kaddb and kshiftlb both have 4-cycle latency on Intel, 1 on Zen 4 (https://uops.info/), but kaddb is one byte shorter (no immediate). I had to use an intrinsic to get GCC and clang to emit it instead of kshiftlb. (Godbolt)

    # clang 18 -O3 -march=x86-64-v4
    .LCPI0_0:
            .quad   0
            .quad   1
              ...
    add_epi128(long long vector[8], long long vector[8]):
            vpaddq  zmm0, zmm1, zmm0
            vpcmpltuq       k0, zmm0, zmm1
            kaddb   k1, k0, k0
            vpaddq  zmm0 {k1}, zmm0, zmmword ptr [rip + .LCPI0_0]
            ret
    

    Assuming this inlines and keeps the vector constant in a register, this is 4 uops for the front end and back-end execution units. And critical-path latency is about 10 cycles on Intel from when both vector inputs are ready. (1 for the vpaddq instructions, even with merge-masking. 4 or 3 for compare-into-mask vs. 5 on Zen4. 4 for kaddb on Intel vs. 1 on Zen4.)

    If you're doing a lot of this in parallel, e.g. on an array, out-of-order exec should have little trouble finding instruction-level parallelism and keeping execution ports 0 and 5 busy every cycle, for 2 cycle throughput, but the latency sucks so you want to use multiple accumulator vectors if you're e.g. summing an array.

    Or in a case like that, accumulate the carry-out separately and add it back into the high half at the end so you don't have any long loop-carried dep chains.