I am trying to parallelize matrix multiplication. I came across a Fox algorithm which can break the whole thing into 3 step. You can look up the algorithm. I have attached an image.enter image description here The Code:
void matmul_fox(int *matrixA, int *matrixB, int *zeroMatrix, int n){
#pragma acc kernels copyin(matrixA[0:n*n], matrixB[0:n*n]) copy(zeroMatrix[0:n*n])
{
// Operation 1
#pragma acc loop independent
for (int i = 0; i < n; i++){ // Extract the diagonal from arr1
#pragma acc loop independent
for (int j = 0; j < n; j++){ // Add arr3 and multiplication of diag(broadcasted)*arr2
#pragma acc loop independent reduction (+: zeroMatrix[0:n*n])
for (int k = 0; k < n; k++){
zeroMatrix[j * n + k] += matrixA[j * n + j] * matrixB[j * n + k];
}
}
// Operation 2
// Shift-up on the rows of arr2
#pragma acc loop independent
for (int a = 0; a < n; a++) {
int temp1 = matrixB[0 * n + a];
#pragma acc loop independent
for (int j = 0; j < n - 1; j++) {
matrixB[j * n + a] = matrixB[(j + 1) * n + a];
}
matrixB[(n - 1) * n + a] = temp1;
}
// Operation 3
// Shift-left on the rows of arr1
#pragma acc loop independent
for (int b = 0; b < n; b++){
int temp2 = matrixA[b * n + 0]; // Store the first element of the row
#pragma acc loop independent
for (int j = 0; j < n - 1; j++){
matrixA[b * n + j] = matrixA[b * n + (j+1)]; // Shift elements to the left
}
matrixA[b * n + (n - 1)] = temp2; // Place the first element at the end
}
} }
}
Since Operation 2 and Operation 3 have dependencies, I am not able to parallelize these 3 operations all together. However I read somewhere, that we can use clauses like async, wait and update to tackle this issue. Maybe Operation 2 and Operation 3 can wait till Operation 1 is done, and once it's done Operation 2 and Operation 3 can be done parallelly as they have no dependencies between them.
I tried without using wait clauses, but since there are visible dependencies, it's giving the wrong result. Without using OpenACC, the code works properly, but the execution time is more than the traditional sequential algorithm. If the Fox algorithm can work parallelly, I can expect better performance.
The outer "i" loop isn't parallizable nor are the inner "j" loops in Operations 2 and 3. You'll need to only parallelize the middle loops using three offload regions. Given Operations 2 and 3 will be rather small, I'd also launch these two offload regions concurrently via the OpenACC async clauses. The additional wait clauses ensure Operation 1 is complete before launching 2 and 3.
For example:
// Parallel implementation using Fox algorithm
void matmul_fox(int *matrixA, int *matrixB, int *zeroMatrix, int n){
#pragma acc data copyin(matrixA[0:n*n], matrixB[0:n*n]) copy(zeroMatrix[0:n*n])
{
for (int i = 0; i < n; i++){ // Extract the diagonal from matrixA
#pragma acc parallel loop collapse(2) async(1) wait(2,3)
for (int j = 0; j < n; j++){ // Add zeroMatrix and multiplication of diag(matrixA)*matrixB
for (int k = 0; k < n; k++){
zeroMatrix[j * n + k] += matrixA[j * n + j] * matrixB[j * n + k];
}
}
// Cyclic Shift-up on the rows of matrixB
#pragma acc parallel loop async(2) wait(1)
for (int a = 0; a < n; a++) {
int temp1 = matrixB[0 * n + a];
for (int j = 0; j < n - 1; j++) {
matrixB[j * n + a] = matrixB[(j + 1) * n + a];
}
matrixB[(n - 1) * n + a] = temp1;
}
// Cyclic Shift-left on the rows of matrixA
#pragma acc parallel loop async(3) wait(1)
for (int b = 0; b < n; b++){
int temp2 = matrixA[b * n + 0]; // Store the first element of the row
for (int j = 0; j < n - 1; j++){
matrixA[b * n + j] = matrixA[b * n + (j+1)]; // Shift elements to the left
}
matrixA[b * n + (n - 1)] = temp2; // Place the first element at the end
}
}
#pragma acc wait
}
}