I have two arrays say 'fa' and 'tempxyz'. I need to subtract one from the other and store it in another array. I am using streaming stores. So I need to have aligned accesses. I aligned these two arrays and also the third array. I am still getting a seg. fault. For a streaming store, the arrays should be 64 byte aligned. Does this mean that every element of the array should be 64 bytes apart so that every element's address is a multiple of 64 ? I have given my code snippet below. Kindly help me out.
main()
{
double *force = ( double * ) _mm_malloc ( (nd * np )* sizeof ( double ),64);
// np can be any number (np=1000, 2000, etc.)
// nd = 3
__declspec(align(64)) double array[np*nd];
compute (force, array);
}
void compute (double *f double array[np*nd])
{
__declspec(align(64)) double fa[8], tempxyz[8];
for(k=0;k<np;k++)
{
__assume_aligned(f,64);
__assume((k*nd) % 8 == 0);
for ( i = 0; i < nd; i++ )
{
f[i+k*nd] = 0.0;
}
// Doing some computation on array and storing it in fa.
fa[0] = array[k*nd+0];
fa[1] = array[k*nd+1];
fa[2] = array[k*nd+2];
__m512d y1, y2, y3;
__assume_aligned(&fa,64);
__assume_aligned(&tempxyz,64);
// Want to load 3 elements at a time, subtract all the three
// and store it at a memory location.
y1 = _mm512_load_pd(fa);
y2 = _mm512_load_pd(tempxyz);
y3 = _mm512_sub_pd(y1,y2);
__assume_aligned(f,64);
__assume((k*nd) % 8 == 0); // Here nd=3 and k is loop index variable.
_mm512_storenr_pd((f+k*nd), y3); // streaming store instruction
// --- GIVING SEG. FAULT !!!
} // end of k loop
}// end of compute function
The array 'force' is 64-byte aligned. Thus every access to the force array should be 64-byte aligned i.e. the address of the element that is accessed should be a multiple of 64. At a time using the load_pd instruction, 8 doubles are loaded. (f + k * nd) accesses 3rd element when k=1 and 6th element when k=2 and so on. But the beginning of the 3rd element corresponds to 25th byte which is not a multiple of 64 and that is the reason why a segfault is occurring (similarly for other k values). So the formula (f + k * nd) itself should be changed so that every access using the formula is a multiple of 64.