[SOLVED] How to debug OpenMP scalabilty

How to debug OpenMP scalabilty

I'm trying to implement Conway's Game of Life in C, showing that it can scale with OpenMP. The results in the following picture are from running it on a socket of a machine with 64 cores equally divided in 4 NUMA regions. It shows a clear stall in performance when using 42 cores and then a big decrease when increasing the number of cores. The script is run setting OMP_PLACES=cores and OMP_PROC_BIND=close.

OpenMP scalability results

I made a simplified and lighter version of the code, which obtains similar results, for this forum:

double tend;
double tstart;
double ex_time;
#define NROWS 3000
#define NCOLS 3000
#define NUM_STEPS 50
#define PROB 0.5

int count_alive_neighbours_multi(const unsigned char *restrict map,
                                 const int ncols, const int index) {
    int prev_row = index - ncols;
    int next_row = index + ncols;
    int count0 = is_alive(&map[index - 1]) + is_alive(&map[index + 1]);
    int count1 = is_alive(&map[prev_row]) + is_alive(&map[prev_row - 1]) +
                 is_alive(&map[prev_row + 1]);
    int count2 = is_alive(&map[next_row]) + is_alive(&map[next_row - 1]) +
                 is_alive(&map[next_row + 1]);
    return count0 + count1 + count2;
}

void smaller_static_evolution(int num_steps, unsigned char *restrict current,
                              int ncols, int nrows) {
    /*Performs a single step of the update of the map*/
    /* Create copy of MPI local map (current) but initialize it with OpenMP to
     * place it in the correct threads */
    printf("Running smaller_static_evolution with \nnum_steps: %d\n",
           num_steps);
    printf("ncols: %d\n", ncols);
    printf("nrows: %d\n", nrows);

    unsigned char *sub_map = NULL;

    #pragma omp parallel
    {
        #pragma omp single
        {
            sub_map =
                (unsigned char *)malloc(ncols * nrows * sizeof(unsigned char));
            if (sub_map == NULL) {
                printf("Could not allocate memory for local maps\n");
                exit(1);
            }
        }
        int i = 0;

        /* Initialize local sub map with OpenMP, to warm up the data properly*/
        #pragma omp for schedule(static, 4)
        for (int i = 0; i < nrows * ncols; i++) {
            sub_map[i] = current[i];
        }

        for (int iter = 0; iter < num_steps; iter++) {
            #pragma omp for schedule(static, 4) private(i)
            for (int row = 1; row < nrows - 1;
                 row++) { /* Split work with OpenMP over rows */
                for (int col = 1; col < ncols - 1; col++) {
                    i = row * ncols + col;
                    int alive_counter =
                        count_alive_neighbours_multi(sub_map, ncols, i);
                    sub_map[i] += UPDATE_CELL(alive_counter);
                }
            }
        }

        /* Copy sub_map into current and free sub_map */
        #pragma omp for schedule(static, 4)
        for (int i = 0; i < nrows * ncols; i++) {
            current[i] = sub_map[i];
        }
    }
    free(sub_map);
}

int main(int argc, char **argv) {
    int nrows = NROWS;
    int ncols = NCOLS + calculate_cache_padding(NCOLS);
    unsigned char *current =
        (unsigned char *)malloc(nrows * ncols * sizeof(unsigned char));
    /* Randomly initialize matrix */
    for (int i = 0; i < nrows * ncols; i++) {
        float random = (float)rand() / RAND_MAX;
        if (random < PROB)
            current[i] = 1;
        else
            current[i] = 0;
    }
    tstart = CPU_TIME;
    smaller_static_evolution(NUM_STEPS, current, ncols, nrows);
    free(current);
    tend = CPU_TIME;
    ex_time = tend - tstart;
    printf("\n\n%f\n\n", ex_time);
}

The function calculate_cache_padding() returns the number of columns that are necessary to make the total number of columns a multiple of the cache line size (64). Since the loop over rows is split between the OpenMP threads, I'm not expecting this to produce false sharing.

Is there anything that I'm clearly doing wrong? Does the plot suggest anything obvious or do you have any ideas on how I can debug this?

Solution

First: You should allocate the sub_map only once per program run. That's a sequential bottleneck.

Then: Do not use a chunk size in the static schedule. It makes the cores interleave what data they access, and that may conflict with the close binding. Also, your chunk size is much smaller than the cacheline, so you probably get false sharing and other conflicts.