c++multithreadingopenmpfalse-sharing

OpenMP poor performance with arrays


I have the following issue: I am trying to parallelize a very simple PDE solver in c++ with openMP but the performance does not improve if I increase the number of threads. The equation is a simple 1D heat equation with convection. Since I need the solution at each timestep I have decided to work with a 2D array

double solution[iterationsTime][numPoints];

In which each row contains the discretized function at a specific time step. The update is done via a for loop

#pragma omp parallel default(shared) private(t, i, iBefore, iAfter)
{
for(t=0; t<iterationsTime; t++)

#pragma omp for schedule(auto) 
   for(i=0; i<numPoints; i++) {
       iBefore = (i==0)?numPoints-2:i-1;
       iAfter = (i==numPoints-1)?1:i;
       solution[t+1][i] = solution[t][iAfter] - solution[t][iBefore];
}

The values iBefore and iAfter are used because I am essentialy treating the array as a ring buffer so the PDE has periodic boundary conditions and the domain is treated as a ring. Anyways, each update on solution[t+1] requires some computations on solution[t] like the one shown in the code above. I understand that the cause of the poor scalability is most likely false sharing so I have transformed the 2D matrix into a 3D matrix

double solution[iterationsTime][numPoints][PAD];

This allows me to ensure that no write operation is performed on a shared cache line as I can vary the size of PAD. The code changes a bit since now every value will be stored in

solution[t][i][0];

And the following one in

solution[t][i+1][0];

Note that the memory required is allocated on the heap using the new operator outside the parallel region. The code works very well but it is not scalable. I have tried different schedules like static, dynamic, auto, ... I compile it with

g++ code.cpp -fopenmp -march=native -O3 -o out

I have tried to remove or add the -march and -O3 flags but I do not see any improvement. I have tried different sizes of PAD and environment variables like OMP_PROC_BIND but no improvement. I have no idea what is causing the loss in performance at this point. This is the code

const int NX = 500; //DOMAIN DISCRETIZATION
const int PAD = 8; //PADDING TO AVOID FALSE SHARING
const double DX = 1.0/(NX-1.0); //STEP IN SPACE
const double DT = 0.01*DX; //STEP IN TIME
const int NT = 1000; //MAX TIME ITERATIONS
const double C = 10.0; //CONVECTION VELOCITY
const double K = 0.01; //DIFFUSION COEFFICIENT

int main(int argc, char **argv) {
  omp_set_num_threads(std::atoi(argv[1]));  //SET THE REQUIRED NUMBER OF THREADS

  //INTIALING MEMORY --> USING STD::VECTOR INSTEAD OF DOUBLE***
  std::vector<std::vector<std::vector<double>>> solution(NT, std::vector<std::vector<double>>(NX, std::vector<double>(PAD,0)));
  for (int i=0; i<NX; i++){
    solution[0][i][0] = std::sin(i*DX*2*M_PI); //INITIAL CONDITION
  }

  int numThreads, i, t, iBefore, iAfter;
  double energy[NT]{0.0}; //ENERGY of the solution --> e(t)= integral from 0 to 1 of ||u(x,t)||^2 dx

  //SOLVE THE PDE ON A RING
  double start = omp_get_wtime();
  #pragma omp parallel default(none) shared(solution, energy, numThreads, std::cout) private(i, t, iBefore, iAfter)
  {
    #pragma omp master
    numThreads = omp_get_num_threads();

    for(t=0; t<NT-1; t++){

      #pragma omp for schedule(static, 8) nowait
      for(i=0; i<NX; i++){
        iBefore = (i==0)?NX-2:i-1;
        iAfter = (i==NX-1)?1:i+1;
        solution[t+1][i][0]=solution[t][i][0] 
          + DT*(  -C*((solution[t][iAfter][0]-solution[t][iBefore][0])/(2*DX))
            + K*(solution[t][iAfter][0]-2*solution[t][i][0]+ solution[t][iBefore][0])/(DX*DX)  );
      }

      // COMPUTE THE ENERGY OF PREVOIUS TIME ITERATION
      #pragma omp for schedule(auto) reduction(+:energy[t])
      for(i=0; i<NX; i++) {
        energy[t] += DX*solution[t][i][0]*solution[t][i][0];
      }
    }
  }
  std::cout << "numThreads: " <<numThreads << ". Elapsed Time: "<<(omp_get_wtime()-start)*1000 << std::endl;
  return 0;
}

And the performances

numThreads: 1. Elapsed Time: 9.65456
numThreads: 2. Elapsed Time: 9.1855
numThreads: 3. Elapsed Time: 9.85965
numThreads: 4. Elapsed Time: 8.9077
numThreads: 5. Elapsed Time: 15.5986
numThreads: 6. Elapsed Time: 15.5627
numThreads: 7. Elapsed Time: 16.204
numThreads: 8. Elapsed Time: 17.5612

Solution

  • Analysis

    First of all, you are working on a too small granularity for the multi-threading to be very efficient. Indeed, your sequential time is 9.6 ms and there is 999 time-step. As a result, each time-step take roughly 9.6 us which is rather small.

    Additionally, memory accesses are not efficient:

    Finally, using a schedule with blocks of size 8 seems too small. Specifying just schedule(static) should be probably better here for both the parallel-for and the reduction (the schedule should be the same and static for both if you are using nowait and you want correct results I think).

    Consequently, you are probably measuring latency and memory overheads.

    Improved version

    Here is the corrected code with the most important fixes (false-sharing effects are ignored):

    #include <iostream>
    #include <vector>
    #include <cmath>
    #include <omp.h>
    
    const int NX = 500; //DOMAIN DISCRETIZATION
    const int PAD = 8; //PADDING TO AVOID FALSE SHARING
    const double DX = 1.0/(NX-1.0); //STEP IN SPACE
    const double DT = 0.01*DX; //STEP IN TIME
    const int NT = 1000; //MAX TIME ITERATIONS
    const double C = 10.0; //CONVECTION VELOCITY
    const double K = 0.01; //DIFFUSION COEFFICIENT
    
    int main(int argc, char **argv) {
      omp_set_num_threads(std::atoi(argv[1]));  //SET THE REQUIRED NUMBER OF THREADS
    
      //INTIALING MEMORY --> USING A FLATTEN DOUBLE ARRAY
      std::vector<double> solution(NT * NX);
      for (int i=0; i<NX; i++){
        solution[0*NX+i] = std::sin(i*DX*2*M_PI); //INITIAL CONDITION
      }
    
      int numThreads, i, t, iBefore, iAfter;
      double energy[NT]{0.0}; //ENERGY of the solution --> e(t)= integral from 0 to 1 of ||u(x,t)||^2 dx
    
      //SOLVE THE PDE ON A RING
      double start = omp_get_wtime();
      #pragma omp parallel default(none) shared(solution, energy, numThreads, std::cout) private(i, t, iBefore, iAfter)
      {
        #pragma omp master
        numThreads = omp_get_num_threads();
    
        for(t=0; t<NT-1; t++){
    
          #pragma omp for schedule(static) nowait
          for(i=0; i<NX; i++){
            iBefore = (i==0)?NX-2:i-1;
            iAfter = (i==NX-1)?1:i+1;
            solution[(t+1)*NX+i]=solution[t*NX+i]
              + DT*(  -C*((solution[t*NX+iAfter]-solution[t*NX+iBefore])/(2*DX))
                + K*(solution[t*NX+iAfter]-2*solution[t*NX+i]+ solution[t*NX+iBefore])/(DX*DX)  );
          }
    
          // COMPUTE THE ENERGY OF PREVOIUS TIME ITERATION
          #pragma omp for schedule(static) reduction(+:energy[t])
          for(i=0; i<NX; i++) {
            energy[t] += DX*solution[t*NX+i]*solution[t*NX+i];
          }
        }
      }
      std::cout << "numThreads: " <<numThreads << ". Elapsed Time: "<<(omp_get_wtime()-start)*1000 << std::endl;
      return 0;
    }
    

    Results

    On my machine with 6 cores (Intel i5-9600KF). I get the following results.

    Before:

    1 thread : 3.35 ms
    2 threads: 2.90 ms
    3 threads: 2.89 ms
    4 threads: 2.83 ms
    5 threads: 3.07 ms
    6 threads: 2.90 ms
    

    After:

    1 thread : 1.62 ms
    2 threads: 1.03 ms
    3 threads: 0.87 ms
    4 threads: 0.95 ms
    5 threads: 1.00 ms
    6 threads: 1.16 ms
    

    With the new version, the sequential time is much faster and it succeeds to scale well up to 3 cores. Then the synchronization overheads become significant and slow the overall execution down (note that each time-step last for less than 1 us here which is very small).