How to use fused multiply and add in AVX for 16 bit packed integers

I know there it is possible to do multiply-and-add using a single instruction in AVX2. I want to use multiply-and-add instruction where each 256-bit AVX2 variable is packed with 16, 16-bit variables. For instance, consider the example below,

res=a0*b0+a1*b1+a2*b2+a3*b3

here each of res, a0, a1, a2, a3, b0, b1, b2, b3 are 16-bit variables. I have closely followed the discussion. Please find my code below to calculate the example shown above,

#include<stdio.h>
#include<stdint.h>
#include <immintrin.h>
#include<time.h>
#include "cpucycles.c"

#pragma STDC FP_CONTRACT ON

#define AVX_LEN 16

inline __m256i mul_add(__m256i a, __m256i b, __m256i c) { 
    return _mm256_add_epi16(_mm256_mullo_epi16(a, b), c);
}

void fill_random(int16_t *a, int32_t len){  //to fill up the random array

    int32_t i;

    for(i=0;i<len;i++){     
        a[i]=(int16_t)rand()&0xffff;
    }
}

void main(){


    int16_t a0[16*AVX_LEN], b0[16*AVX_LEN];
    int16_t a1[16*AVX_LEN], b1[16*AVX_LEN];
    int16_t a2[16*AVX_LEN], b2[16*AVX_LEN];
    int16_t a3[16*AVX_LEN], b3[16*AVX_LEN];
    int16_t res[16*AVX_LEN];


    __m256i a0_avx[AVX_LEN], b0_avx[AVX_LEN];
    __m256i a1_avx[AVX_LEN], b1_avx[AVX_LEN];
    __m256i a2_avx[AVX_LEN], b2_avx[AVX_LEN];
    __m256i a3_avx[AVX_LEN], b3_avx[AVX_LEN];

    __m256i res_avx[AVX_LEN];

    int16_t res_avx_check[16*AVX_LEN];
    int32_t i,j;

    uint64_t mask_ar[4]; //for unloading AVX variables
    mask_ar[0]=~(0UL);mask_ar[1]=~(0UL);mask_ar[2]=~(0UL);mask_ar[3]=~(0UL);
    __m256i mask;
    mask = _mm256_loadu_si256 ((__m256i const *)mask_ar);

    time_t t;
    srand((unsigned) time(&t));


    int32_t repeat=100000;

    uint64_t clock1, clock2, fma_clock;

    clock1=clock2=fma_clock=0;

    for(j=0;j<repeat;j++){
        printf("j : %d\n",j);

        fill_random(a0,16*AVX_LEN);// Genrate random data
        fill_random(a1,16*AVX_LEN);
        fill_random(a2,16*AVX_LEN);
        fill_random(a3,16*AVX_LEN);

        fill_random(b0,16*AVX_LEN);
        fill_random(b1,16*AVX_LEN);
        fill_random(b2,16*AVX_LEN);
        fill_random(b3,16*AVX_LEN);


        for(i=0;i<AVX_LEN;i++){ //Load values in AVX variables

            a0_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&a0[i*16]));
            a1_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&a1[i*16]));
            a2_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&a2[i*16]));
            a3_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&a3[i*16]));

            b0_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&b0[i*16]));
            b1_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&b1[i*16]));
            b2_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&b2[i*16]));
            b3_avx[i] = _mm256_loadu_si256 ((__m256i const *) (&b3[i*16]));
        }

        for(i=0;i<AVX_LEN;i++){
            res_avx[i]= _mm256_set_epi64x(0, 0, 0, 0);
        }

        //to calculate a0*b0 + a1*b1 + a2*b2 + a3*b3

        //----standard calculation----
        for(i=0;i<16*AVX_LEN;i++){
            res[i]=a0[i]*b0[i] + a1[i]*b1[i] + a2[i]*b2[i] + a3[i]*b3[i];
        }


        //-----AVX-----

        clock1=cpucycles();


        for(i=0;i<AVX_LEN;i++){ //simple approach

            a0_avx[i]=_mm256_mullo_epi16(a0_avx[i], b0_avx[i]);
            res_avx[i]=_mm256_add_epi16(a0_avx[i], res_avx[i]);

            a1_avx[i]=_mm256_mullo_epi16(a1_avx[i], b1_avx[i]);
            res_avx[i]=_mm256_add_epi16(a1_avx[i], res_avx[i]);

            a2_avx[i]=_mm256_mullo_epi16(a2_avx[i], b2_avx[i]);
            res_avx[i]=_mm256_add_epi16(a2_avx[i], res_avx[i]);

            a3_avx[i]=_mm256_mullo_epi16(a3_avx[i], b3_avx[i]);
            res_avx[i]=_mm256_add_epi16(a3_avx[i], res_avx[i]);

        }

        /*
        for(i=0;i<AVX_LEN;i++){ //FMA approach

            res_avx[i]=mul_add(a0_avx[i], b0_avx[i], res_avx[i]);

            res_avx[i]=mul_add(a1_avx[i], b1_avx[i], res_avx[i]);
            res_avx[i]=mul_add(a2_avx[i], b2_avx[i], res_avx[i]);

            res_avx[i]=mul_add(a3_avx[i], b3_avx[i], res_avx[i]);

        }
        */

        clock2=cpucycles();
        fma_clock = fma_clock + (clock2-clock1);

        //-----Check----

        for(i=0;i<AVX_LEN;i++){ //store avx results for comparison
            _mm256_maskstore_epi64 (res_avx_check + i*16, mask, res_avx[i]);
        }

        for(i=0;i<16*AVX_LEN;i++){
            if(res[i]!=res_avx_check[i]){

                printf("\n--ERROR--\n");
                return;
            }   

        }
    }


    printf("Total time taken is :%llu\n", fma_clock/repeat);

}

The cpucycles code is from ECRYPT and given below,

#include "cpucycles.h"

long long cpucycles(void)
{
  unsigned long long result;
  asm volatile(".byte 15;.byte 49;shlq $32,%%rdx;orq %%rdx,%%rax"
    : "=a" (result) ::  "%rdx");
  return result;
}

My gcc -version returns,

gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-36)

I am using

 Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz

When I run this on my computer, I get the following cycles for fma approach and simple approach respectively

FMA approach : Total time taken is :109
Simple approach : Total time taken is :141

As you can see, the FMA approach is slightly faster but I expected to be even more faster. I understand that in my sample code there are many memory accesses which might be the reason of deteriorating performance. But,

When I dump the assembly I see the almost similar instructions for both approaches. I do not see any fma instructions in the FMA version. I don't understand the reason. Is it beacause _mm256_mullo_epi16 instructions?
Is my approach correct?
Can you please help me to fix this?

I am new to AVX2 programming so it is highly possible that I ahve done something which is not very standard but I will be glad to answer something which is not clear. I thank you all for your help in advance.

Solution

x86 doesn't have SIMD-integer FMA / MAC (multiply-accumulate) other than horizontal pmaddubsw / pmaddwd which add horizontal into wider integers. (Until AVX512IFMA _mm_madd52lo_epu64 or AVX512_4VNNIW _mm512_4dpwssd_epi32(__m512i, __m512ix4, __m128i *)).

FP-contract and -ffast-math options have nothing to do with SIMD-integer stuff; integer math is always exact.

I think your "simple" approach is slower because you're also modifying the input arrays and this doesn't get optimized away, e.g.

a0_avx[i] = _mm256_mullo_epi16(a0_avx[i], b0_avx[i]);

as well as updating res_avx[i].

If the compiler doesn't optimize that away, those extra stores might be exactly why it's slower than your mul_add function. rdtsc without a serializing instruction doesn't even have to wait for earlier instructions to execute, let alone retire or commit stores to L1d cache, but extra uops for the front-end are still more to chew through. At only 1 store per clock throughput, that could easily be a new bottleneck.

FYI, you don't need to copy your inputs to arrays of __m256i. Normally you'd just use SIMD loads on regular data. That's no slower than indexing arrays of __m256i. Your arrays are too big for the compiler to fully unroll and keep everything in registers (like it would for scalar __m256i variables).

If you'd just used __m256i a0 = _mm256_loadu_si256(...) inside the loop then you could have updated a0 without slowing your code down because it would just be a single local variable that can be kept in a YMM reg.

But I find it's good style to use new named tmp vars for most steps to make code more self-documenting. Like __m256i ab = ... or sum = .... You could reuse the same sum temporary for each a0+b0 and a1+b1.

You might also use a temporary for the result vector instead of making the compiler optimize away the updates of memory at res_avx[i] until the final one.

You can use alignas(32) int16_t a0[...]; to make plain arrays aligned for _mm256_load instead of loadu.

Your cpucycles() RDTSC function doesn't need to use inline asm. Use __rdtsc() instead.