I am using OpenACC to compare the execution time of a parallelized vs non-parallelized matrix multiplication operation on the CPU using PGI Community Edition 19.10 (on Windows). The code I am using is:
#include <time.h>
#include <stdlib.h>
int main()
{
// seed the random number generator
srand(42);
// Pick some arbitrary constraints to make the problem harder
const int SIZE_XY = 1000;
const int MIN_VAL = 5000;
const int MAX_VAL = 7000000;
int i, j, k; // iterators
double time_spent = 0.0;
clock_t begin = clock();
// Generate two 2D arrays to be filled with random numbers
// and an array, c, with all 0s
int *a[SIZE_XY];
int *b[SIZE_XY];
int *c[SIZE_XY];
for (i = 0; i < SIZE_XY; i++)
{
a[i] = (int *)malloc(SIZE_XY * sizeof(int));
b[i] = (int *)malloc(SIZE_XY * sizeof(int));
c[i] = (int *)malloc(SIZE_XY * sizeof(int));
}
#pragma acc kernels
{
for (i = 0; i < SIZE_XY; i++)
{
for (j = 0; j < SIZE_XY; j++)
{
a[i][j] = (rand() % MAX_VAL) + MIN_VAL;
b[i][j] = (rand() % MAX_VAL) + MIN_VAL;
c[i][j] = 0;
}
}
}
printf("Array A allocated and filled with random numbers ...\n");
printf("Array B allocated and filled with random numbers ...\n");
printf("Array C initialized ...\n");
// Dot product the two arrays together into c
#pragma acc kernels //HERE
{
for (i = 0; i < SIZE_XY; i++)
{
for (j = 0; j < SIZE_XY; j++)
{
for (k = 0; k < SIZE_XY; k++)
{
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
}
}
}
printf("Matrices multiplied ...\n");
printf("The first three values of A x B are %d, %d, %d\n", c[0][0], c[0][1], c[0][2]);
clock_t end = clock();
time_spent += (double)(end - begin) / CLOCKS_PER_SEC;
printf("Time elpased is %f seconds", time_spent);
}
When I run the following command in the PGI CMD: pgcc -acc -ta=multicore -Minfo=all,accel matrixACC.c
I receive the following:
59, Complex loop carried dependence of a->,c->,b-> prevents parallelization
62, Complex loop carried dependence of a->,c->,b-> prevents parallelization
64, Complex loop carried dependence of a->,c->,b-> prevents parallelization
Loop carried dependence due to exposed use of c[i1][i2] prevents parallelization
Could I please get some help understanding why this is occurring and how I would be able to parallelize the loops calculating the matrix multiplication.
Thank you
The compiler is unable to determine if your 3 pointer variables (a
, b
, c
) might ever alias one another. If they were to alias one another in some way, then the independence of calculating any particular c[i][j]
could not be ascertained, and it would not be able to properly parallelize (any of) the loops.
One possible method to address this is to inform the compiler that you guarantee as the programmer that (for example) the first loop represents independent activity (amongst its various iterations). You can do this by placing #pragma acc loop independent
immediately before the first for
-loop statement. For the matrix sizes you have chosen here (and multicore target), that will give you plenty of exposed parallelism. (The compiler will still emit Minfo
messages about the non-parallelization of other loops, but that is OK, most likely. For a multicore target, having 1000 parallel work items should be plenty to get good performance).
Note that your calculations will easily overflow int
storage for the initialization range you have chosen. You'll get nonsense results.
The following code has possible ways to address the above items:
$ cat t1.c
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
int main()
{
// seed the random number generator
srand(42);
// Pick some arbitrary constraints to make the problem harder
const int SIZE_XY = 1000;
const int MIN_VAL = 5000;
const int MAX_VAL = 7000000;
int i, j, k; // iterators
double time_spent = 0.0;
clock_t begin = clock();
// Generate two 2D arrays to be filled with random numbers
// and an array, c, with all 0s
int * restrict a[SIZE_XY];
int * restrict b[SIZE_XY];
int * restrict c[SIZE_XY];
for (i = 0; i < SIZE_XY; i++)
{
a[i] = (int *)malloc(SIZE_XY * sizeof(int));
b[i] = (int *)malloc(SIZE_XY * sizeof(int));
c[i] = (int *)malloc(SIZE_XY * sizeof(int));
}
#pragma acc kernels
{
for (i = 0; i < SIZE_XY; i++)
{
for (j = 0; j < SIZE_XY; j++)
{
a[i][j] = 1; //(rand() % MAX_VAL) + MIN_VAL;
b[i][j] = 1; //(rand() % MAX_VAL) + MIN_VAL;
c[i][j] = 0;
}
}
}
printf("Array A allocated and filled with random numbers ...\n");
printf("Array B allocated and filled with random numbers ...\n");
printf("Array C initialized ...\n");
// Dot product the two arrays together into c
#pragma acc kernels //HERE
{
#pragma acc loop independent
for (i = 0; i < SIZE_XY; i++)
{
for (j = 0; j < SIZE_XY; j++)
{
for (k = 0; k < SIZE_XY; k++)
{
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
}
}
}
printf("Matrices multiplied ...\n");
printf("The first three values of A x B are %d, %d, %d\n", c[0][0], c[0][1], c[0][2]);
clock_t end = clock();
time_spent += (double)(end - begin) / CLOCKS_PER_SEC;
printf("Time elpased is %f seconds", time_spent);
}
$ gcc -o t1 t1.c -std=c99
$ pgcc -acc -ta=multicore -Minfo=all,accel t1.c -o t1p
"t1.c", line 21: warning: use of a const variable in a constant expression is
nonstandard in C
int * restrict a[SIZE_XY];
^
"t1.c", line 22: warning: use of a const variable in a constant expression is
nonstandard in C
int * restrict b[SIZE_XY];
^
"t1.c", line 23: warning: use of a const variable in a constant expression is
nonstandard in C
int * restrict c[SIZE_XY];
^
"t1.c", line 11: warning: variable "MIN_VAL" was declared but never referenced
const int MIN_VAL = 5000;
^
"t1.c", line 12: warning: variable "MAX_VAL" was declared but never referenced
const int MAX_VAL = 7000000;
^
main:
33, Loop is parallelizable
Generating Multicore code
33, #pragma acc loop gang
35, Loop is parallelizable
52, Loop is parallelizable
Generating Multicore code
52, #pragma acc loop gang
54, Complex loop carried dependence of a->,c->,b-> prevents parallelization
56, Complex loop carried dependence of a->,c->,b-> prevents parallelization
Loop carried dependence of c-> prevents parallelization
Loop carried backward dependence of c-> prevents vectorization
$ time ./t1
Array A allocated and filled with random numbers ...
Array B allocated and filled with random numbers ...
Array C initialized ...
Matrices multiplied ...
The first three values of A x B are 1000, 1000, 1000
Time elpased is 9.010000 seconds
real 0m9.079s
user 0m9.019s
sys 0m0.061s
$ time ./t1p
Array A allocated and filled with random numbers ...
Array B allocated and filled with random numbers ...
Array C initialized ...
Matrices multiplied ...
The first three values of A x B are 1000, 1000, 1000
Time elpased is 20.140000 seconds
real 0m0.563s
user 0m20.053s
sys 0m0.132s
$
On my machine, the code compiled with gcc takes about 9 seconds whereas the code compiled with the PGI OpenACC compiler takes about 0.5 seconds.
As an aside, I personally would generally avoid the array allocation method you have chosen, because the various malloc
operations are not guaranteed to result in adjacent/contiguous allocations. However for a multicore
target the code works fine as-is.
To address this concern, I'd recommend some changes to your code like this:
$ cat t1.c
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
typedef int mt;
#define SIZE_XY 1000
typedef mt mat[SIZE_XY];
int main()
{
// seed the random number generator
srand(42);
// Pick some arbitrary constraints to make the problem harder
int i, j, k; // iterators
double time_spent = 0.0;
clock_t begin = clock();
// Generate two 2D arrays to be filled with random numbers
// and an array, c, with all 0s
mat * restrict a;
mat * restrict b;
mat * restrict c;
a = (mat *)malloc(SIZE_XY*SIZE_XY * sizeof(mt));
b = (mat *)malloc(SIZE_XY*SIZE_XY * sizeof(mt));
c = (mat *)malloc(SIZE_XY*SIZE_XY * sizeof(mt));
#pragma acc kernels
{
for (i = 0; i < SIZE_XY; i++)
{
for (j = 0; j < SIZE_XY; j++)
{
a[i][j] = 1; //(rand() % MAX_VAL) + MIN_VAL;
b[i][j] = 1; //(rand() % MAX_VAL) + MIN_VAL;
c[i][j] = 0;
}
}
}
printf("Array A allocated and filled with random numbers ...\n");
printf("Array B allocated and filled with random numbers ...\n");
printf("Array C initialized ...\n");
// Dot product the two arrays together into c
#pragma acc kernels
{
for (i = 0; i < SIZE_XY; i++)
{
for (j = 0; j < SIZE_XY; j++)
{
for (k = 0; k < SIZE_XY; k++)
{
c[i][j] = c[i][j] + a[i][k] * b[k][j];
}
}
}
}
printf("Matrices multiplied ...\n");
printf("The first three values of A x B are %d, %d, %d\n", c[0][0], c[0][1], c[0][2]);
clock_t end = clock();
time_spent += (double)(end - begin) / CLOCKS_PER_SEC;
printf("Time elpased is %f seconds", time_spent);
}
$ gcc -o t1 t1.c -std=c99 -O3
$ pgcc -acc -ta=multicore -Minfo=all,accel t1.c -o t1p
main:
32, Loop is parallelizable
Generating Multicore code
32, #pragma acc loop gang
34, Loop is parallelizable
51, Loop is parallelizable
Generating Multicore code
51, #pragma acc loop gang
53, Loop is parallelizable
55, Complex loop carried dependence of c-> prevents parallelization
Loop carried dependence of c-> prevents parallelization
Loop carried backward dependence of c-> prevents vectorization
$ time ./t1
Array A allocated and filled with random numbers ...
Array B allocated and filled with random numbers ...
Array C initialized ...
Matrices multiplied ...
The first three values of A x B are 1000, 1000, 1000
Time elpased is 0.650000 seconds
real 0m0.708s
user 0m0.663s
sys 0m0.047s
$ time ./t1p
Array A allocated and filled with random numbers ...
Array B allocated and filled with random numbers ...
Array C initialized ...
Matrices multiplied ...
The first three values of A x B are 1000, 1000, 1000
Time elpased is 17.510000 seconds
real 0m0.499s
user 0m17.466s
sys 0m0.093s
$
(gcc 4.8.5, pgcc 20.5-0, Xeon E5-2690 v2, 40 cores total)
There are a few advantages:
restrict
keyword to communicate our intent to the compiler, without use of an extra pragmaa
, for b
, and for c
, resulting in a simpler behavior if you decide to switch from multicore
to an accelerator targetgcc
) emits code that runs almost as fast as the OpenACC code. (~0.7s vs. ~0.5s)