c++multithreadingvectoropenmp

why does creating a C++ vector in an openMP parallel for loop cause it to slow significantly?


I am trying to initialize a std::vector while inside a parallel for loop using openMP but it causes it to take significantly more time than even a serial loop would take.

Here is an example of what I am trying:

#include <iostream>
#include <vector>
#include <omp.h>
#include <chrono>

int main()
{
    auto start = std::chrono::high_resolution_clock::now();

    //processor supports 24 threads
//#pragma omp parallel for
    for (int i = 0; i < 24; ++i) {
        
        for (int j = 0; j < 100; ++j) {
            for (int k = 0; k < 100; ++k) {
                for (int l = 0; l < 100; ++l) {
                    int result = i + j + k + l;

                    std::vector<int> resultsOver200 = {};
                    if (result > 200) {
                        resultsOver200.push_back(result);
                    }
                    //do something else with result
                }
            }
        }
    }

    auto end = std::chrono::high_resolution_clock::now();
    std::chrono::duration<double> duration = end - start;

    std::cout << "Execution time: " << duration.count() << " seconds";
}

Without utilizing omp I get this as a result:

Execution time: 4.60552 seconds

If I uncomment the '#pragma omp parallel for' and run it the results are:

Execution time: 38.7252 seconds

This is a very simplified version of what I am actually trying to accomplish but gets the point across. Anyone know why this is happening or have a way for me to utilize a dynamic array where I don't know the element count at compile time while still taking advantage of multithreading?

Also as a note, I definitely want to create the new vector within the innermost for loop.


Solution

  • Allocations generally does not scale. In fact, contention of the default allocator can make the code significantly slower. Indeed, allocators typically use locks and atomics so when many threads fight for the same shared resources, contention happen and this contention is very expensive: it not only serialize the code but make it slower due to a higher latency (e.g. cache line bouncing NUMA effects, etc.).

    To reduce this overhead, you can use an allocator better suited for parallel codes using thread-local pools (e.g. jemalloc). No allocator perfectly scale though and in this case, most will not scale so this is certainly not enough.

    One efficient solution to solve this issue is to recycle the vector in order to reduce allocations. Allocations are expensive, even in sequential. You should really avoid them as much as possible in hot loops. You could use custom allocators for that (see the answer of @AhmedAEK) or simply move the vector initialization of resultsOver200 outside all the loop nest and call resultsOver200.resize(0) before using it in the inner-most loop (it will not be reallocated unless the capacity is not large enough). Be careful though: the vector capacity can stay huge so it might be good to force its capacity to be smaller periodically so not to use too much RAM (e.g. by calling resultsOver200.shrink_to_fit() in the i-based or j-based loop).

    By the way, you should use steady_clock instead of high_resolution_clock for measuring the wall-clock time.