I have a question about my attempt to use different chunk size in OpenMP to avoid false-sharing. So I created 2 large vectors, and measured sum operation time with 2 different chunk sizes.
#include <omp.h>
#include <vector>
#include <iostream>
#include <chrono>
#include <new>
int main(int argc, char const *argv[])
{
std::size_t n = 10000000 * 64;
std::vector<int32_t> a(n, 1);
std::vector<int32_t> b(n, 1);
std::vector<int32_t> c(n, 0);
auto start = std::chrono::system_clock::now();
// #pragma omp parallel for schedule(static, std::hardware_destructive_interference_size / sizeof(int32_t))
#pragma omp parallel for
for (std::size_t i = 0; i < a.size(); ++i)
{
c[i] = a[i] + b[i];
}
auto end = std::chrono::system_clock::now();
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << elapsed.count() << std::endl;
return 0;
}
However I always get that "#pragma omp parallel for" works twice faster than second option. Why does it happen? I actually just break reading cache, trying to optimize writting operation? I saw the Use of OpenMP chunk to break cache answer, but it actualy doesn't help me.
Direct answer: I don't know why you see the performance difference.
Also: false sharing is irrelevant in your code. False sharing is only a problem if there are many repeated access to the same cacheline from different threads.
Indirect answer: modern processors and compilers almost completely do away with false sharing. Let me give an example. Suppose I have an array a
with length p
where p
is the number of threads. And each thread t
continually writes in a[t]
. That's a textbook example of false sharing, not?
The following code (computing pi by Riemann integration) runs twice, once with threads writing at distance 1, meaning false sharing, and once at distance 8 meaning a cacheline apart:
long int steps = steps_per_thread*nthreads;
double inc=2./steps;
for ( int dilate : {static_cast<int>(1),8} ) {
double global_area = 0., tstart,tend;
tstart = omp_get_wtime();
vector<double> local_area( dilate*nthreads );
for (int i=0; i<NEXPERIMENT; i++) {
#pragma omp parallel
{
int mythread = omp_get_thread_num(),
my_nsteps=steps/nthreads,
my_first=mythread*my_nsteps,
my_last=(mythread+1)*my_nsteps-1;
for (int s=my_first; s<my_last; s++) {
double x=-1.+s*inc;
double a=inc*sqrt(1-x*x);
local_area[dilate*mythread] += 2*a;
}
}
}
for (int i=0; i<nthreads; i++)
global_area += local_area[dilate*i];
global_area /= NEXPERIMENT;
tend = omp_get_wtime();
printf("Computed area %e in %e on %d threads\n",
global_area,tend-tstart,nthreads);
}
Output on an Intel skylake:
Computed area 3.141592e+00 in 1.087189e-03 on 1 threads
Computed area 3.141592e+00 in 1.096964e-03 on 1 threads
Execute program false cores 2
Computed area 3.141573e+00 in 1.116037e-03 on 2 threads
Computed area 3.141573e+00 in 1.096964e-03 on 2 threads
Execute program false cores 3
Computed area 3.141567e+00 in 1.132011e-03 on 3 threads
Computed area 3.141567e+00 in 1.098871e-03 on 3 threads
Execute program false cores 4
Computed area 3.141565e+00 in 1.426935e-03 on 4 threads
Computed area 3.141565e+00 in 1.100063e-03 on 4 threads
Execute program false cores 5
Computed area 3.141564e+00 in 1.129150e-03 on 5 threads
Computed area 3.141564e+00 in 1.103163e-03 on 5 threads
Execute program false cores 6
Computed area 3.141563e+00 in 1.462936e-03 on 6 threads
Computed area 3.141563e+00 in 1.102924e-03 on 6 threads
Execute program false cores 7
Computed area 3.141563e+00 in 1.140118e-03 on 7 threads
Computed area 3.141563e+00 in 1.104116e-03 on 7 threads
Execute program false cores 8
Computed area 3.141563e+00 in 1.369953e-03 on 8 threads
Computed area 3.141563e+00 in 1.106024e-03 on 8 threads
See? Either the compiler or the processor keeps the accumulators in temporaries and only resolves at the end because the second variant which "solves" the false sharing problem is barely faster.