c++cachingopenmp

False-sharing and OpenMP


I have a question about my attempt to use different chunk size in OpenMP to avoid false-sharing. So I created 2 large vectors, and measured sum operation time with 2 different chunk sizes.

  1. I used default #pragma omp parallel for
  2. Commented option: I used chunk such that each thread should work with it's own cache-line on writting operation. std::hardware_destructive_interference_size in my case is 64 bytes.
#include <omp.h>
#include <vector>
#include <iostream>
#include <chrono>
#include <new>

int main(int argc, char const *argv[])
{
    std::size_t n = 10000000 * 64;
    std::vector<int32_t> a(n, 1);
    std::vector<int32_t> b(n, 1);
    std::vector<int32_t> c(n, 0);

    auto start = std::chrono::system_clock::now();

    // #pragma omp parallel for schedule(static, std::hardware_destructive_interference_size / sizeof(int32_t))
#pragma omp parallel for
    for (std::size_t i = 0; i < a.size(); ++i)
    {
        c[i] = a[i] + b[i];
    }
    auto end = std::chrono::system_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    std::cout << elapsed.count() << std::endl;

    return 0;
}

However I always get that "#pragma omp parallel for" works twice faster than second option. Why does it happen? I actually just break reading cache, trying to optimize writting operation? I saw the Use of OpenMP chunk to break cache answer, but it actualy doesn't help me.


Solution

  • Direct answer: I don't know why you see the performance difference.

    Also: false sharing is irrelevant in your code. False sharing is only a problem if there are many repeated access to the same cacheline from different threads.

    Indirect answer: modern processors and compilers almost completely do away with false sharing. Let me give an example. Suppose I have an array a with length p where p is the number of threads. And each thread t continually writes in a[t]. That's a textbook example of false sharing, not?

    The following code (computing pi by Riemann integration) runs twice, once with threads writing at distance 1, meaning false sharing, and once at distance 8 meaning a cacheline apart:

      long int steps = steps_per_thread*nthreads;
      double inc=2./steps;
    
      for ( int dilate : {static_cast<int>(1),8} ) {
        double global_area = 0., tstart,tend;
        tstart = omp_get_wtime();
        vector<double> local_area( dilate*nthreads );
        for (int i=0; i<NEXPERIMENT; i++) {
    #pragma omp parallel
          {
            int mythread = omp_get_thread_num(),
              my_nsteps=steps/nthreads,
              my_first=mythread*my_nsteps,
              my_last=(mythread+1)*my_nsteps-1;
            for (int s=my_first; s<my_last; s++) {
              double x=-1.+s*inc;
              double a=inc*sqrt(1-x*x);
              local_area[dilate*mythread] += 2*a;
            }
          }
        }
        for (int i=0; i<nthreads; i++)
          global_area += local_area[dilate*i];
        global_area /= NEXPERIMENT;
        tend = omp_get_wtime();
        printf("Computed area %e in %e on %d threads\n",
               global_area,tend-tstart,nthreads);
      }
    

    Output on an Intel skylake:

    Computed area 3.141592e+00 in 1.087189e-03 on 1 threads
    Computed area 3.141592e+00 in 1.096964e-03 on 1 threads
    Execute program false cores 2
    Computed area 3.141573e+00 in 1.116037e-03 on 2 threads
    Computed area 3.141573e+00 in 1.096964e-03 on 2 threads
    Execute program false cores 3
    Computed area 3.141567e+00 in 1.132011e-03 on 3 threads
    Computed area 3.141567e+00 in 1.098871e-03 on 3 threads
    Execute program false cores 4
    Computed area 3.141565e+00 in 1.426935e-03 on 4 threads
    Computed area 3.141565e+00 in 1.100063e-03 on 4 threads
    Execute program false cores 5
    Computed area 3.141564e+00 in 1.129150e-03 on 5 threads
    Computed area 3.141564e+00 in 1.103163e-03 on 5 threads
    Execute program false cores 6
    Computed area 3.141563e+00 in 1.462936e-03 on 6 threads
    Computed area 3.141563e+00 in 1.102924e-03 on 6 threads
    Execute program false cores 7
    Computed area 3.141563e+00 in 1.140118e-03 on 7 threads
    Computed area 3.141563e+00 in 1.104116e-03 on 7 threads
    Execute program false cores 8
    Computed area 3.141563e+00 in 1.369953e-03 on 8 threads
    Computed area 3.141563e+00 in 1.106024e-03 on 8 threads
    

    See? Either the compiler or the processor keeps the accumulators in temporaries and only resolves at the end because the second variant which "solves" the false sharing problem is barely faster.