I want to parallelize a particle method based fluid flow code using OpenACC in C language. I am quite new to OpenACC and trying to understand its basics while currently applying it to the code on a multicore computer. Later, I shall try to offload it to GPU. I have added some #pragmas to the for loops in the code. In a portion of code, when I compile the code without -fast, it compiles without any problems but parallelizes only the outer loop, however, when I include -fast during compiling the code, it gives me some data dependency messages and inner loop(s) are not parallelized. I have tried many things after reading the available literature including use of restrict with declaration of pointers and use of atomic and routine statements etc but so far nothing has seemed to work. The abridged version of the part of the code is here:
// the code intends to compute the total number of neighbour particles of "iParticle" in
// particle.numberOfNeighborParticles[iParticle] and saves the list of these neighbour particles in
// particle.neighborTable[iParticle][Neigh]
int iX, iY;
#pragma acc parallel loop private(iX, iY) //line 98
for (iParticle = 0; iParticle < particle.totalNumber; iParticle++)
{
BUCKET_findBucketWhereParticleIsStored(&iX, &iY, iParticle, position);
#pragma acc loop seq // line 133
for (jX = iX - 1; jX <= iX + 1; jX++)
{
.....
#pragma acc loop seq // line 179
for (jY = iY - 1; jY <= iY + 1; jY++)
{
......
#pragma acc loop // line 186
for (iStoredParticle = 0; iStoredParticle < domain.bucket[jX][jY].count; iStoredParticle++)
{
jParticle = domain.bucket[jX][jY].list[iStoredParticle];
xij = (position[XDIM][jParticle] - position[XDIM][iParticle]);
distanceIJ_squared = xij * xij;
yij = (position[YDIM][jParticle] - position[YDIM][iParticle]);
distanceIJ_squared += yij * yij;
if (distanceIJ_squared > parameter.maxRadius_squared)
continue;
NEIGH_addParticleInNeighborTable(iParticle, jParticle, particle.numberOfNeighborParticles, particle.neighborTable);
}
}
}
}
//The *NEIGH_addParticleInNeighborTable()* function is as under:
void
NEIGH_addParticleInNeighborTable(
int iParticle
,int jParticle
,int *restrict numberOfNeighborParticles
,int **restrict neighborTable
){
int iNeigh;
iNeigh = numberOfNeighborParticles[iParticle];
neighborTable[iParticle][iNeigh] = jParticle;
#pragma acc atomic
numberOfNeighborParticles[iParticle]++;
}
EDIT:
I have added a pseudo code below, which is quite similar to my problem, to elaborate the issue:
//This pseudo code intends to find the contiguous states from a given list for each state of US
count=0;
//state[] is a list of all the states of US
#pragma acc paralel loop gang
for(i=0;i<no_of_states_in_US;i++)
{
iState=state[i];
#pragma acc loop vector
for (j = 0; j < no_of_states_to_check_from_for[iState]; j++){ //no_of_states_to_check_from_for[iState] may be 5
jState = names_of_states_to_check_for_iState[j]; // for KS the given states to check from may be CO, NE, CA, UT and OK
// some logic to check whether jState is contiguous to iState
if(jState is_NOT_adjacent_to_iState) continue;
//race condition occurs below if inner loop is vectorized, but no race condition if outer loop is parallelized only
// Any suggestions / work around to vectorize the inner loop here and to avoid race condition would be helpful
contiguous_state[iState][count]=jState;
#pragma acc atomic //?? does not seem to work
count++;
}
}
I am interested to vectorize the inner loop becuase this portion of code is among the computational intensive parts and is repeated several times in the code. I am using PGI 19.4 community edition on Windows 10. Help in this regard is requested. Thanks in advance.
For the second, edited question, so long as the order in which count is updated doesn't matter, then you can use an atomic capture instead. This will capture count's value into a local variable so you don't need to worry about it changing. Something like:
int cnt;
#pragma acc atomic capture
{
cnt = count;
count++;
}
contiguous_state[iState][cnt]=jState;
If the order of count does matter, then the loop is not parallelizable.