cparallel-processingopenaccpgi

OpenACC - Complex loop carried dependence of a->,c->,b-> prevents parallelization


I am using OpenACC to compare the execution time of a parallelized vs non-parallelized matrix multiplication operation on the CPU using PGI Community Edition 19.10 (on Windows). The code I am using is:

#include <time.h>
#include <stdlib.h>

int main()
{
    // seed the random number generator
    srand(42);

    // Pick some arbitrary constraints to make the problem harder
    const int SIZE_XY = 1000;
    const int MIN_VAL = 5000;   
    const int MAX_VAL = 7000000;

    int i, j, k; // iterators

    double time_spent = 0.0;
    clock_t begin = clock();

    // Generate two 2D arrays to be filled with random numbers
    // and an array, c, with all 0s
    int *a[SIZE_XY];
    int *b[SIZE_XY];
    int *c[SIZE_XY];
    for (i = 0; i < SIZE_XY; i++)
    {
        a[i] = (int *)malloc(SIZE_XY * sizeof(int));
        b[i] = (int *)malloc(SIZE_XY * sizeof(int));
        c[i] = (int *)malloc(SIZE_XY * sizeof(int));
    }

    #pragma acc kernels
    {
        for (i = 0; i < SIZE_XY; i++)
        {
            for (j = 0; j < SIZE_XY; j++)
            {
                a[i][j] = (rand() % MAX_VAL) + MIN_VAL;
                b[i][j] = (rand() % MAX_VAL) + MIN_VAL;
                c[i][j] = 0;
            }
        }
    }

    printf("Array A allocated and filled with random numbers ...\n");
    printf("Array B allocated and filled with random numbers ...\n");
    printf("Array C initialized ...\n");

    // Dot product the two arrays together into c
    #pragma acc kernels //HERE
    {
        for (i = 0; i < SIZE_XY; i++)
        {
            for (j = 0; j < SIZE_XY; j++)
            {
                for (k = 0; k < SIZE_XY; k++)
                {
                    c[i][j] = c[i][j] + a[i][k] * b[k][j];
                }
            }
        }
    }

    printf("Matrices multiplied ...\n");
    printf("The first three values of A x B are %d, %d, %d\n", c[0][0], c[0][1], c[0][2]);

    clock_t end = clock();

    time_spent += (double)(end - begin) / CLOCKS_PER_SEC;

    printf("Time elpased is %f seconds", time_spent);
}

When I run the following command in the PGI CMD: pgcc -acc -ta=multicore -Minfo=all,accel matrixACC.c I receive the following:

59, Complex loop carried dependence of a->,c->,b-> prevents parallelization
62, Complex loop carried dependence of a->,c->,b-> prevents parallelization
64, Complex loop carried dependence of a->,c->,b-> prevents parallelization
    Loop carried dependence due to exposed use of c[i1][i2] prevents parallelization 

Could I please get some help understanding why this is occurring and how I would be able to parallelize the loops calculating the matrix multiplication.

Thank you


Solution

  • The compiler is unable to determine if your 3 pointer variables (a, b, c) might ever alias one another. If they were to alias one another in some way, then the independence of calculating any particular c[i][j] could not be ascertained, and it would not be able to properly parallelize (any of) the loops.

    One possible method to address this is to inform the compiler that you guarantee as the programmer that (for example) the first loop represents independent activity (amongst its various iterations). You can do this by placing #pragma acc loop independent immediately before the first for-loop statement. For the matrix sizes you have chosen here (and multicore target), that will give you plenty of exposed parallelism. (The compiler will still emit Minfo messages about the non-parallelization of other loops, but that is OK, most likely. For a multicore target, having 1000 parallel work items should be plenty to get good performance).

    Note that your calculations will easily overflow int storage for the initialization range you have chosen. You'll get nonsense results.

    The following code has possible ways to address the above items:

    $ cat t1.c
    #include <time.h>
    #include <stdlib.h>
    #include <stdio.h>
    int main()
    {
        // seed the random number generator
        srand(42);
    
        // Pick some arbitrary constraints to make the problem harder
        const int SIZE_XY = 1000;
        const int MIN_VAL = 5000;
        const int MAX_VAL = 7000000;
    
        int i, j, k; // iterators
    
        double time_spent = 0.0;
        clock_t begin = clock();
    
        // Generate two 2D arrays to be filled with random numbers
        // and an array, c, with all 0s
        int * restrict a[SIZE_XY];
        int * restrict b[SIZE_XY];
        int * restrict c[SIZE_XY];
        for (i = 0; i < SIZE_XY; i++)
        {
            a[i] = (int *)malloc(SIZE_XY * sizeof(int));
            b[i] = (int *)malloc(SIZE_XY * sizeof(int));
            c[i] = (int *)malloc(SIZE_XY * sizeof(int));
        }
    
        #pragma acc kernels
        {
            for (i = 0; i < SIZE_XY; i++)
            {
                for (j = 0; j < SIZE_XY; j++)
                {
                    a[i][j] = 1; //(rand() % MAX_VAL) + MIN_VAL;
                    b[i][j] = 1; //(rand() % MAX_VAL) + MIN_VAL;
                    c[i][j] = 0;
                }
            }
        }
    
        printf("Array A allocated and filled with random numbers ...\n");
        printf("Array B allocated and filled with random numbers ...\n");
        printf("Array C initialized ...\n");
    
        // Dot product the two arrays together into c
        #pragma acc kernels //HERE
        {
            #pragma acc loop independent
            for (i = 0; i < SIZE_XY; i++)
            {
                for (j = 0; j < SIZE_XY; j++)
                {
                    for (k = 0; k < SIZE_XY; k++)
                    {
                        c[i][j] = c[i][j] + a[i][k] * b[k][j];
                    }
                }
            }
        }
    
        printf("Matrices multiplied ...\n");
        printf("The first three values of A x B are %d, %d, %d\n", c[0][0], c[0][1], c[0][2]);
    
        clock_t end = clock();
    
        time_spent += (double)(end - begin) / CLOCKS_PER_SEC;
    
        printf("Time elpased is %f seconds", time_spent);
    }
    $ gcc -o t1 t1.c -std=c99
    $ pgcc -acc -ta=multicore -Minfo=all,accel t1.c -o t1p
    "t1.c", line 21: warning: use of a const variable in a constant expression is
              nonstandard in C
          int * restrict a[SIZE_XY];
                           ^
    
    "t1.c", line 22: warning: use of a const variable in a constant expression is
              nonstandard in C
          int * restrict b[SIZE_XY];
                           ^
    
    "t1.c", line 23: warning: use of a const variable in a constant expression is
              nonstandard in C
          int * restrict c[SIZE_XY];
                           ^
    
    "t1.c", line 11: warning: variable "MIN_VAL" was declared but never referenced
          const int MIN_VAL = 5000;
                    ^
    
    "t1.c", line 12: warning: variable "MAX_VAL" was declared but never referenced
          const int MAX_VAL = 7000000;
                    ^
    
    main:
         33, Loop is parallelizable
             Generating Multicore code
             33, #pragma acc loop gang
         35, Loop is parallelizable
         52, Loop is parallelizable
             Generating Multicore code
             52, #pragma acc loop gang
         54, Complex loop carried dependence of a->,c->,b-> prevents parallelization
         56, Complex loop carried dependence of a->,c->,b-> prevents parallelization
             Loop carried dependence of c-> prevents parallelization
             Loop carried backward dependence of c-> prevents vectorization
    $ time ./t1
    Array A allocated and filled with random numbers ...
    Array B allocated and filled with random numbers ...
    Array C initialized ...
    Matrices multiplied ...
    The first three values of A x B are 1000, 1000, 1000
    Time elpased is 9.010000 seconds
    real    0m9.079s
    user    0m9.019s
    sys     0m0.061s
    $ time ./t1p
    Array A allocated and filled with random numbers ...
    Array B allocated and filled with random numbers ...
    Array C initialized ...
    Matrices multiplied ...
    The first three values of A x B are 1000, 1000, 1000
    Time elpased is 20.140000 seconds
    real    0m0.563s
    user    0m20.053s
    sys     0m0.132s
    $
    

    On my machine, the code compiled with gcc takes about 9 seconds whereas the code compiled with the PGI OpenACC compiler takes about 0.5 seconds.

    As an aside, I personally would generally avoid the array allocation method you have chosen, because the various malloc operations are not guaranteed to result in adjacent/contiguous allocations. However for a multicore target the code works fine as-is.

    To address this concern, I'd recommend some changes to your code like this:

    $ cat t1.c
    #include <time.h>
    #include <stdlib.h>
    #include <stdio.h>
    
    typedef int mt;
    #define SIZE_XY 1000
    typedef mt mat[SIZE_XY];
    
    int main()
    {
        // seed the random number generator
        srand(42);
    
        // Pick some arbitrary constraints to make the problem harder
    
        int i, j, k; // iterators
    
        double time_spent = 0.0;
        clock_t begin = clock();
    
        // Generate two 2D arrays to be filled with random numbers
        // and an array, c, with all 0s
        mat * restrict a;
        mat * restrict b;
        mat * restrict c;
        a = (mat *)malloc(SIZE_XY*SIZE_XY * sizeof(mt));
        b = (mat *)malloc(SIZE_XY*SIZE_XY * sizeof(mt));
        c = (mat *)malloc(SIZE_XY*SIZE_XY * sizeof(mt));
    
        #pragma acc kernels
        {
            for (i = 0; i < SIZE_XY; i++)
            {
                for (j = 0; j < SIZE_XY; j++)
                {
                    a[i][j] = 1; //(rand() % MAX_VAL) + MIN_VAL;
                    b[i][j] = 1; //(rand() % MAX_VAL) + MIN_VAL;
                    c[i][j] = 0;
                }
            }
        }
    
        printf("Array A allocated and filled with random numbers ...\n");
        printf("Array B allocated and filled with random numbers ...\n");
        printf("Array C initialized ...\n");
    
        // Dot product the two arrays together into c
        #pragma acc kernels
        {
            for (i = 0; i < SIZE_XY; i++)
            {
                for (j = 0; j < SIZE_XY; j++)
                {
                    for (k = 0; k < SIZE_XY; k++)
                    {
                        c[i][j] = c[i][j] + a[i][k] * b[k][j];
                    }
                }
            }
        }
    
        printf("Matrices multiplied ...\n");
        printf("The first three values of A x B are %d, %d, %d\n", c[0][0], c[0][1], c[0][2]);
    
        clock_t end = clock();
    
        time_spent += (double)(end - begin) / CLOCKS_PER_SEC;
    
        printf("Time elpased is %f seconds", time_spent);
    }
    $ gcc -o t1 t1.c -std=c99 -O3
    $ pgcc -acc -ta=multicore -Minfo=all,accel t1.c -o t1p
    main:
         32, Loop is parallelizable
             Generating Multicore code
             32, #pragma acc loop gang
         34, Loop is parallelizable
         51, Loop is parallelizable
             Generating Multicore code
             51, #pragma acc loop gang
         53, Loop is parallelizable
         55, Complex loop carried dependence of c-> prevents parallelization
             Loop carried dependence of c-> prevents parallelization
             Loop carried backward dependence of c-> prevents vectorization
    $ time ./t1
    Array A allocated and filled with random numbers ...
    Array B allocated and filled with random numbers ...
    Array C initialized ...
    Matrices multiplied ...
    The first three values of A x B are 1000, 1000, 1000
    Time elpased is 0.650000 seconds
    real    0m0.708s
    user    0m0.663s
    sys     0m0.047s
    $ time ./t1p
    Array A allocated and filled with random numbers ...
    Array B allocated and filled with random numbers ...
    Array C initialized ...
    Matrices multiplied ...
    The first three values of A x B are 1000, 1000, 1000
    Time elpased is 17.510000 seconds
    real    0m0.499s
    user    0m17.466s
    sys     0m0.093s
    $
    

    (gcc 4.8.5, pgcc 20.5-0, Xeon E5-2690 v2, 40 cores total)

    There are a few advantages:

    1. We can use the c99 restrict keyword to communicate our intent to the compiler, without use of an extra pragma
    2. This will be a contiguous allocation for a, for b, and for c, resulting in a simpler behavior if you decide to switch from multicore to an accelerator target
    3. The OpenACC compiler can now handle the first two loop nests without extra help.
    4. The gnu compiler likes this level of communication also. On my machine, the "ordinary" gnu compiler (gcc) emits code that runs almost as fast as the OpenACC code. (~0.7s vs. ~0.5s)