How to perform parallel addition using AVX with carry (overflow) fed back into the same element (PE checksum)?

I want to perform eight parallel adds of 16bit values using AVX SIMD. Addition with overflow is required, i.e. 'add with carry' like it is performed with the old "adc" x86 mnemonic.

I implemented 50 percent of the AVX solution myself, the carry handling, also performed by AVX instructions, is missing. My current solution:

typedef union _uint16vec uint16vec, *uint16vec_ptr;

union  __attribute__((aligned(16))) _uint16vec
{
  __m128i             x;
  uint16_t            y[8];
};

__m128i parallel_add_with_carry ( __m128i n1, __m128i n2 )
{
  volatile uint16vec  res, stupid_carry;
  uint32_t            i;

  stupid_carry.x = n1; /* load 8x uint16_t for carry adjustment below */

  __asm__
  (
    "movdqa %1, %%xmm0             \n\t"
    "movdqa %2, %%xmm1             \n\t"
    "vpaddw %%xmm0, %%xmm1, %%xmm0 \n\t"
    "movdqa %%xmm0, %0             \n\t"
    : "=m" (res.x)                      /* output */
    : "m" (n1), "m" (n2)                /* inputs */
    : "xmm0", "xmm1"                    /* GCC, please clobber XMM0 and XMM1 */
  );

  /* if each of the eight uint16_t in the result is lesser than
   * the previous value, then we have the overflow situation...
   */
  for (i=0;i<8;i++)
    res.y[i] += (res.y[i] < stupid_carry.y[i]) ? 1 : 0;

  return res.x;
}

void test ( void )
{
  uint16vec   v1 = {0}, v2 = {0}, res;

  v1.y[0] = 0x000A; v2.y[0] = 0x0014; /* 10+20 = 30 (0x001E), no overflow */
  v1.y[1] = 0xFFF0; v2.y[1] = 0x0013; /* 0xFFF0 + 0x0013 = 0x0003 -> overflow -> 0x0004 */

  res.x = parallel_add_with_carry(v1.x, v2.x);

  fprintf(stdout,"%04X | %04X\n", res.y[0], res.y[1]);
}

The GCC-emitted object code of the function's epilogue is terrible (even with -O3). My question is if there is a better AVX-supported solution for the 'add with carry' problem?

May the 'vpcmp<CC>uw' compare instruction with <CC>=LT (lesser than) help?
How can I use these 'K' (mask) registers (K0..K7) for that?

My idea was to provide a 128bit vector { 0x0001,0x0001,...,0x0001} as a 128bit temporary variable adding this (the carry vector) to the eight uint16_t's if and only if the preceding compare operation resulted in a 'lesser than' for specific uint16_t in the 128bit vector.

I browsed the Intel documentation and found nice add instructions that just copy source parts of the vector if a condition is met.

Support with this 'AVX thing' is highly appreciated. Thanks.

Solution

One simple idea is to leverage unsigned saturation, which makes the results of the addition different from the overflowing addition when overflow happens.

__m128i parallel_add_with_carry(__m128i n1, __m128i n2)
{
    // Make an all-ones vector. This is virtually free and does not
    // consume an execution unit.
    const __m128i mm_all_ones = _mm_cmpeq_epi32(n1, n1);

    __m128i mm_sum = _mm_add_epi16(n1, n2);
    // Since there is no "not equal" comparison that produces
    // a vector mask, we compare for "equal" and get a negated
    // carry mask.
    __m128i mm_carry = _mm_cmpeq_epi16(_mm_adds_epu16(n1, n2), mm_sum);
    // Negate the mask, which turns elements that were 0 to -1 and vise versa
    mm_carry = _mm_xor_si128(mm_carry, mm_all_ones);
    // Add carry (i.e. subtract -1 or 0)
    mm_sum = _mm_sub_epi16(mm_sum, mm_carry);

    return mm_sum;
}

Godbolt

This implementation is compatible with SSE2 and later. Using AVX-512 and opmask registers is possible, but unlikely to be any faster. Here are some estimates (asm code generated by gcc 14.2 with -O3 -march=rocketlake):

Asm-based solution from Devvy's answer. Note that here I'm using the code as presented in the answer verbatim, with the unoptimal instruction encoding selection and passing arguments via stack.
The same AVX-512 implementation ported to intrinsics (Godbolt). This fixes the code generation inefficiencies in the asm version.
SSE2 implementation from this answer.

If you're processing moderate amounts of data that likely fit in cache, it may be beneficial to further optimize this code by unrolling the loop and using separate accumulators for the sum and carry, and only add them together at the end, when all data is processed. This improves instruction-level parallelism (ILP) as the separate accumulations form separate dependency chains and can happen in parallel, provided that there are enough arithmetic execution units in the CPU and the algorithm isn't bottlenecked elsewhere (e.g. by memory bandwidth).

__m128i parallel_add_with_carry_loop(const unsigned char* data, size_t size)
{
    __m128i mm_sum1 = _mm_setzero_si128(), mm_sum2 = _mm_setzero_si128();
    __m128i mm_carry1 = _mm_setzero_si128(), mm_carry2 = _mm_setzero_si128();
    const __m128i mm_all_ones = _mm_cmpeq_epi32(mm_sum1, mm_sum1);

    for (size_t i = 0u, n = size & (size_t)(-32); i < n; i += 32)
    {
        __m128i mm1 = _mm_loadu_si128((const __m128i*)(data + i));
        __m128i mm2 = _mm_loadu_si128((const __m128i*)(data + i + 16));
        __m128i mm_new_sum1 = _mm_add_epi16(mm_sum1, mm1);
        __m128i mm_new_sum2 = _mm_add_epi16(mm_sum2, mm2);
        __m128i mm_neg_carry1 = _mm_cmpeq_epi16(
            _mm_adds_epu16(mm_sum1, mm1), mm_new_sum1);
        __m128i mm_neg_carry2 = _mm_cmpeq_epi16(
            _mm_adds_epu16(mm_sum2, mm2), mm_new_sum2);
        mm_sum1 = mm_new_sum1;
        mm_sum2 = mm_new_sum2;
        mm_carry1 = _mm_sub_epi16(mm_carry1,
            _mm_xor_si128(mm_neg_carry1, mm_all_ones));
        mm_carry2 = _mm_sub_epi16(mm_carry2,
            _mm_xor_si128(mm_neg_carry2, mm_all_ones));
    }

    // If size is not a multiple of 32, process the tail bytes here
    // ...

    // Combine the accumulated sum and carry. Note that adding sums
    // may overflow, and we need to account the carry from this addition.
    mm_sum1 = parallel_add_with_carry(mm_sum1, mm_sum2);
    mm_carry1 = _mm_add_epi16(mm_carry1, mm_carry2);
    mm_sum1 = _mm_add_epi16(mm_sum1, mm_carry1);

    return mm_sum1;
}