c++memoryparallel-processingautocorrelation

Overhead in parallel tasks in C++


I have the following function:

#include <iostream>
#include <vector>
#include <cmath>

const int numRows = 10000;
const int numCols = 1000;

void myFunc(int tau, const std::vector<std::vector<double>>& X, const std::vector<std::vector<double>>& P, std::vector<double>& PAutocorrelation) {

     for (int t = 0; t < numRows - tau; ++t) {

          for (int i = 0; i < numCols; ++i) {

               for (int j = 0; j < numCols; ++j) {

                    int diff = static_cast<int>(std::abs(X[t][i] - X[t + tau][j]));

                    PAutocorrelation[diff] += P[t][i] * P[t + tau][j];
               }
           }
      }
}

The function is supposed to perform three nested loops over the dimensions 10000, 1000 and 1000. If I call this function once works fine and the runtime is acceptable. But I would like to call this function 64 times in parallel as I have 64 CPU cores on my node. So I wrote the following code to do this (continuation of the above),

int main() {

std::vector<std::vector<double>> X(numRows, std::vector<double>(numCols, 1.0));
std::vector<std::vector<double>> P(numRows, std::vector<double>(numCols, 1.0));
std::vector<std::vector<double>> PAutocorrelations(63, std::vector<double>(11, 0.0));

#pragma omp parallel for
for (int tau = 1; tau < 64; ++tau) {
    myFunc(tau, X, P, PAutocorrelations[tau - 1]);
}

return 0;
}

The problem is that when I look at CPU load in htop during execution, I can see that at the beginning all 64 CPUs are loaded to 100% in green, but after that (and in most of the time) only one CPU is loaded.

I think this is a memory problem, but I have a hard time to understand this as the code does not involve copying of any variables. So once X and P are allocated, I do not see the need or the use for more memory. Could you indicate where the problem might be ?

I have opened another issue facing basically the same problem in Python here https://stackoverflow.com/questions/77002553/overhead-in-embarrassingly-parallel-workload

Below I show the result of perf stat -d /.main that I am still trying to interpret. But I see that backend cycles idle has the largest part (Why ?). When I compare perf record -e cache-misses ./main for a single run of the function and for 64 parallel runs, I see 400 cache-misses events in single run and 3 million cache-misses events for the parallel runs.

enter image description here

And below is the output of perf report after perf record -e cache-misses ./main. Here I can see 24 % of the cache-misses labeled as std::allocator <double>, which I am also not sure how to interpret.

enter image description here


Solution

  • I can see that at the beginning all 64 CPUs are loaded to 100% in green, but after that (and in most of the time) only one CPU is loaded.

    This is because of several combined effects.

    The first is a slight workload imbalance. Indeed, the first thread have more work than the last because of the first loop of the function myFunc: the first thread iterates from t=0 to numRows-tau=10000-1=9999 while the last thread iterates from t=0 to numRows-tau=10000-63=9937. This is not a big difference but sufficient to see some threads ending about a second before others after few minutes of computations.

    The second is false sharing. Indeed, PAutocorrelation is shared between multiple threads. Each 11-item vector takes 88 bytes so it fits on 2 cache lines of 64 bytes but another vector can be allocated on the same cache line (allocations are generally made with an alignment of 16 bytes). In practice, the effect may or may not append regarding the target platform (dependent of the standard library).

    The third is the OS, memory and the processor itself. The performance of the memory sub-system is quite unstable and this cannot be avoided as long as you access memory. Here, the computation is mostly compute-bound (at least clearly on my machine) so the impact of memory is not significant [1]. However, the OS can interrupt your threads for a small fraction of time and re-shedule them (causing some cache misses due to the context-switches). This is another source of performance instability between threads. Last but not least, there is a data-dependent read/store (on PAutocorrelation[diff]) which can make the computational time pretty different between thread. This is I think the major issue of this code.

    All such time instability can cause some threads to end before others. When a thread has completed its work, it waits for other. This (required) final synchronization is typically the red part you can see on htop. Since there is 63 iterations to compute, only 63 threads can run (over 64) and threads cannot steal the work of other thread in the current code.

    One can try to load balance the work using tasks, but I expect the overhead of tasks to be higher than the workload imbalance.

    I think this is a memory problem, but I have a hard time to understand this as the code does not involve copying of any variables. So once X and P are allocated, I do not see the need or the use for more memory. Could you indicate where the problem might be ?

    I see no memory issue in the code. I also see no memory issue when I run your program on my (Linux) machine. Variables are not expected to be copied since they are shared by default in OpenMP parallel for directives.


    Notes about memory

    [1] Since your CPU has 64 cores. It might be subject to NUMA effects. Especially using multi-socket nodes or AMD CPUs. On my PC, I do not have (much) NUMA effects so this would explain why I cannot observe a big work imbalance. In this cases, the memory needs to be initialized in parallel so to be spread on NUMA nodes. AFAIK, this is not (easily) possible with std::vectors. An alternative solution is to use numactl so to automatically interleave pages over memory banks. Such an automatic configuration is sub-optimal but often better than not paying attention to the distribution of pages on NUMA nodes. This is I think the most probable factor impacting the load balancing of this code on server CPUs.


    How to make the program faster

    While the comments provide several good guidelines to follow in order to make the program better. However, they miss the biggest performance issue in this specific code. Indeed:

    The biggest issue is the data-dependent read/store. Indeed, writing in a cache line takes few cycle. Reading also takes few cycle. If you write in an array and then just read it again, you pay a significant latency of the L1 cache. Modern (x86-64) processors are very cleverly designed and they can optimize this using a store-forwarding strategy. The idea is that the processor keep the stored value in a kind of temporary register so when a load need the value, it does not have to wait for the value to be stored and read again. However, this processor optimization is far from being perfect : there is still generally an additional latency to pay due to store-forwarding. In fact, this is a big issue in this code because the store-forwarding is on the critical path of a very long chain of memory accesses. To make the program faster, we need to break the dependency chain. One way to do that is to use loop unrolling with independent temporary arrays.

    Here is the optimized version:

    static inline void computeCell(const std::vector<std::vector<double>>& X, const std::vector<std::vector<double>>& P, int tau, int t, int i, int j, std::vector<double>& arr)
    {
        const int diff = static_cast<int>(std::abs(X[t][i] - X[t + tau][j]));
        arr[diff] += P[t][i] * P[t + tau][j];
    }
    
    void myFunc(int tau, const std::vector<std::vector<double>>& X, const std::vector<std::vector<double>>& P, std::vector<double>& PAutocorrelation)
    {
        std::vector<double> tmp1(PAutocorrelation.size());
        std::vector<double> tmp2(PAutocorrelation.size());
        std::vector<double> tmp3(PAutocorrelation.size());
        std::vector<double> tmp4(PAutocorrelation.size());
    
        for (int t = 0; t < numRows - tau; ++t) {
            for (int i = 0; i < numCols; ++i) {
                for (int j = 0; j < numCols/4*4; j+=4) {
                    computeCell(X, P, tau, t, i, j+0, tmp1);
                    computeCell(X, P, tau, t, i, j+1, tmp2);
                    computeCell(X, P, tau, t, i, j+2, tmp3);
                    computeCell(X, P, tau, t, i, j+3, tmp4);
                }
                for (int j = numCols/4*4; j < numCols; ++j) {
                    computeCell(X, P, tau, t, i, j, tmp1);
                }
            }
        }
        for (int i = 0; i < PAutocorrelation.size(); ++i)
            PAutocorrelation[i] += tmp1[i] + tmp2[i] + tmp3[i] + tmp4[i];
    }
    

    It is 2.5 faster on both GCC and Clang on my machine, both in sequential and in parallel. This is because different cache-lines are read-written and that each array can be read/store independently. The program can be made about 20% faster by using fixed-size static tmpXX arrays. This as also another benefit: stack-allocated arrays are guaranteed not to cause false-sharing in this case. Thus, it can increase a bit the scalability. On my machine (i5-9600KF with 6 cores), the program is relatively fast and scale very well.

    The key for better performance would be to remove this data-dependent read/store, but this does not seems possible here (it could be if diff would be smaller).

    PS: I used the following compilation flags: -O3 -fopenmp -mavx2 -mfma.