[SOLVED] Segmentation fault due to data alignment issue on MIC

Segmentation fault due to data alignment issue on MIC

I have two arrays say 'fa' and 'tempxyz'. I need to subtract one from the other and store it in another array. I am using streaming stores. So I need to have aligned accesses. I aligned these two arrays and also the third array. I am still getting a seg. fault. For a streaming store, the arrays should be 64 byte aligned. Does this mean that every element of the array should be 64 bytes apart so that every element's address is a multiple of 64 ? I have given my code snippet below. Kindly help me out.

main()
{
 double *force = ( double * ) _mm_malloc ( (nd * np )* sizeof ( double ),64);  
                  // np can be any number (np=1000, 2000, etc.)
                  // nd = 3
 __declspec(align(64)) double array[np*nd];
 compute (force, array);
}

void compute (double *f double array[np*nd])
{
  __declspec(align(64)) double fa[8], tempxyz[8];

   for(k=0;k<np;k++)
   {   

   __assume_aligned(f,64);
   __assume((k*nd) % 8 == 0);

   for ( i = 0; i < nd; i++ )
   {
    f[i+k*nd] = 0.0;      
   }

   // Doing some computation on array and storing it in fa.

   fa[0] = array[k*nd+0];
   fa[1] = array[k*nd+1];
   fa[2] = array[k*nd+2];

   __m512d y1, y2, y3;

   __assume_aligned(&fa,64);
   __assume_aligned(&tempxyz,64);

   // Want to load 3 elements at a time, subtract all the three 
   // and store it at a memory location.

   y1 = _mm512_load_pd(fa);
   y2 = _mm512_load_pd(tempxyz);
   y3 = _mm512_sub_pd(y1,y2); 

   __assume_aligned(f,64);
   __assume((k*nd) % 8 == 0);    // Here nd=3 and k is loop index variable.    
   _mm512_storenr_pd((f+k*nd), y3);  // streaming store instruction 
                                     //   --- GIVING SEG. FAULT !!!

  } // end of k loop

}// end of compute function

Solution

The array 'force' is 64-byte aligned. Thus every access to the force array should be 64-byte aligned i.e. the address of the element that is accessed should be a multiple of 64. At a time using the load_pd instruction, 8 doubles are loaded. (f + k * nd) accesses 3rd element when k=1 and 6th element when k=2 and so on. But the beginning of the 3rd element corresponds to 25th byte which is not a multiple of 64 and that is the reason why a segfault is occurring (similarly for other k values). So the formula (f + k * nd) itself should be changed so that every access using the formula is a multiple of 64.