I want to use the inclusive scan operation in OpenMP to implement an algorithm. What follows is a description of my attempt at doing so, and failing to get more than a tepid speedup.
The inclusive operation is defined as following: for an input vector say [x1,x2,x3,x4]
it ouputs the sequence of partial sums [x1, x1+x2, x1+x2+x3, x1+x2+x3+x4]
. This is a highly parallelizable operation and ostensibly this has been implemented well in OpenMP.
I looked at the following reference: https://theartofhpc.com/pcse/omp-reduction.html#Scanprefixoperations (The manual reference https://www.openmp.org/spec-html/5.0/openmpsu45.html#x68-1940002.9.6 seems too cryptic to me right now)
The artofhpc website says that the reduction clause gets a modifier inscan
:
#pragma omp parallel for reduction(inscan,+:sumvar)`
In the body of the parallel loop there is a scan directive that allows you to store the partial results. For inclusive scans the reduction variable is updated before the scan pragma:
sumvar // update
#pragma omp scan inclusive(sumvar)
partials[i] = sumvar
I tried to follow the same syntax, to measure the performance versus standard serial reduction and the results were quite disappointing. My code is at the bottom of the post.
In the code, I am just doing a simple test by considering a very large vector of size 90 million consisting of random values in the interval [-1,1], and performing a scan on it using an increasing number of threads and measuring the speedup. Here are my results (I get the same answer on repeated runs). My laptop CPU has 16 hardware threads, but the overall speedup is a disappointing 1.36. (I would have expected something substantially more!)
Is there something wrong in the way I am using the OpenMP syntax for reduction?
ā Desktop gcc -fopenmp scantest.c && ./a.out
NumThreads Speedup
1 0.458
2 1.173
3 1.424
4 1.686
5 1.635
6 1.501
7 1.522
8 1.499
9 1.455
10 1.416
11 1.395
12 1.393
13 1.352
14 1.336
15 1.353
16 1.357
#include<stdio.h>
#include<omp.h>
#include<math.h>
#include<stdlib.h>
#include<assert.h>
int main(int argc, char** argv){
int N = 9e7; // vector size
double* x; // vector to reduce
double* partials_s; // vector to scan into
double* partials_p; // vector to scan into
double end, start; // timing variables
double sumvar;
int tmax = argc>1? atoi(argv[1]):35;
int threadcount ;
// Allocate space for all vectors
x = (double*) malloc(sizeof(double)*N);
partials_s = (double*) malloc(sizeof(double)*N);
partials_p = (double*) malloc(sizeof(double)*N);
// Populate the input vectors
for(int i=0 ; i<N ; ++i){
x[i] = -1+2.0*rand()/(double)RAND_MAX;
partials_s[i] = 0.0;
partials_p[i] = 0.0;
}
//-----------------------------------------
// Serial inclusive scan
//-----------------------------------------
start = omp_get_wtime();
sumvar = 0.0;
for(int i=0 ; i<N ; ++i){
sumvar += x[i];
partials_s[i] = sumvar;
}
end = omp_get_wtime();
double stime = end-start; // Time to run the serial code
//-----------------------------------------------------------------------------
// Performance of parallel inclusive scan. Print ratio of serial/parallel time
//----------------------------------------------------------------------------
printf("\nNumThreads Speedup \n");
for(threadcount=1;threadcount<=tmax;++threadcount){
start = omp_get_wtime();
sumvar = 0.0;
#pragma omp parallel for num_threads(threadcount) reduction(inscan,+:sumvar)
for(int i=0 ; i<N ; ++i){
sumvar = sumvar + x[i]; // updating the value of sumvar
#pragma omp scan inclusive(sumvar)
partials_p[i] = sumvar;
}
end = omp_get_wtime();
double ptime = end-start;
printf("%d \t %.3f\n",threadcount,stime/ptime);
}
//for(int i=0 ; i<N ; ++i){
// printf("%.4f %.4f\n", partials_s[i], partials_p[i]);
//}
// Deallocate
free(x);
free(partials_s);
free(partials_p);
return 0;
}
TL;DR: This does not scale because such a scan operation is simply memory bound. More specifically, it is bound by the bandwidth of the DRAM.
On mainstream PCs, only few cores are able to saturate the bandwidth of the DRAM. Sometimes, a single core can (typically a PC with a good CPU and a a single-channel DRAM). The sequential scan is already a memory-bound operation, and on my machine (i5-9600KF CPU with a 2 x 3200MHz DDR4-DRAM) it is not that far from saturating the DRAM. Indeed, it reach 24 GiB/s of throughput. My DRAM can theoretically reach 48 GiB/s for only memory-reads, but in practice it is generally rather 40~42 GiB/s for reads and 36~38 GiB/s for mixed read/writes. This means the sequential code already saturate ~65% of my DRAM bandwidth!
With the parallel code, I reach 20 GiB/s with 1 thread, 30 GiB/s with 2 threads, 35 GiB/s with 3 threads, 36 GiB/s with 4 threads, 35 GiB/s with 6 threads (maximum useful since I only have 6 cores). As we can see, only 3 threads are enough to nearly saturate the bandwidth on my machine. 2 threads are actually not that far to saturate it (~80%).
In fact, this is rather consistent with the speed up printed by your program:
NumThreads Speedup
1 0.456
2 0.686
3 0.765 <---- optimal speed-up for the parallel code
4 0.749
5 0.700
6 0.666
With more than 3 threads, the performances actually decrease. This is probably because DRAM accesses (and the one of the L3 cache) look more random from the CPU perspective and random accesses are a bit less efficient (harder to prefetch). Moreover, cache trashing can cause more data to be loaded from the DRAM so the efficiency decrease while the throughput stay roughly stable.
There is another interesting thing: the parallel implementation is actually slower than the sequential one on my machine, even with 6 threads! This is because the parallel implementation needs to perform more work. Indeed, AFAIK, parallel scan requires multiple memory steps: for example, one to compute partial reduction and one to compute the actual scan by block based on the partial sums computed before. With this implementation, it means more data must be read from memory (twice more) making the operation slower if memory is already the bottleneck. There are other possible implementations but all the one I can think of require to read/write significantly more data from/to DRAM in this loop (with a default static scheduling).
Note I used GCC with the flags -fopenmp -O3
to compile the profiled program (-mavx2
did not impact profiling results). In practice, GOMP (the OpenMP implementation of GCC use to profile this on my machine) apparently perform 2 read/write steps which is a bit less efficient than the one mentioned above. I saw that by profiling the generated assembly code: we can see two hot memory-bound loops in the parallel code and the two seems to perform a kind of scan-like operation (certainly a local scan followed by a scan update). This 2 read/write step implementation should be theoretically twice slower. This is consistent with the speedup of the OpenMP implementation which is twice slower with 1 threads (both on your machine and mine).
There are 2 main optimizations we can perform to make this faster. However, they are not trivial to apply and have significant downsides.
First of all, x86-64 CPU actually read data to write it into DRAM due to cache-line write allocation. You can do non-temporal streaming stores to avoid this. This only worth it if you are sure written data cannot fit in cache. In general, developers cannot guarantee such a thing unless they know the target CPU architecture or make (sometime reasonable) assumptions on it. For example, they can assume the LLC cache is not bigger than few hundreds of MiB (which is not so true because there are some CPUs with a lot of LLC cache like Xeon Phi and future one might have such big LLC caches). OpenMP can theoretically generate non-temporal streaming store instructions with the nontemporal
clause. That being said, it is AFAIK not yet implemented in mainstream OpenMP implementation at the time of writing. This optimization can make the sequential/parallel codes about 50% faster.
Another way to make this faster is to write your own more-efficient cache-friendly parallel implementation. You can read data chunk by chunk (with chunks fitting well in the LLC cache). Once a chunk is read from memory, you can compute the partial sums of the sub-chunks in parallel, and then compute the scan more efficiently (since data should already fit in the LLC cache and do not need to be reloaded). This should strongly improve the efficiency of the parallel implementations so to make it much more competitive with the sequential one (though still a bit slower with 1 thread). Note the additional synchronizations can decrease the performance of this implementation. Besides, NUMA effects can also make things more complicated on machines with multiple NUMA nodes. Ideally, doing such an optimization yourself is not great, and I think it should be the job of the OpenMP implementation...
The two optimizations can be combined together. The streaming store should reduce cache trashing and improve a bit the performance of the second optimization. On my machine, I expect the 2 optimizations to make the parallel implementation about twice faster than the provided sequential one. This still not scale but there is not much more you can do about that since the operation is memory-bound in the first place and memory-bound codes generally do not scale.