I've been experimenting with different parallel sorting techniques in C++ and conducted a performance comparison between:
std::sort
std::sort
utilizing std::execution::par
I ran the tests on a dataset of 20 million integers. Here are the elapsed times for each sorting method:
$ g++ -std=c++20 -fopenmp -pthread sort.cpp -o sort && ./sort
Building a vector of random integers...
Sorting with std::sort...
Elapsed time: 6.1555 s
Sorting with std::sort and std::execution::par...
Elapsed time: 6.45905 s
Sorting with OpenMP...
Elapsed time: 1.05043 s
$ g++ -std=c++20 -fopenmp -pthread sort.cpp -O3 -o sort
Building a vector of random integers...
Sorting with std::sort...
Elapsed time: 1.1355 s
Sorting with std::sort and std::execution::par...
Elapsed time: 1.14028 s
Sorting with OpenMP...
Elapsed time: 0.224919 s
g++ -v 2>&1 | tail -n1
gcc version 11.4.0 (Ubuntu 11.4.0-1ubuntu1~22.04)
The OpenMP implementation significantly outperformed the others, which was surprising, especially considering that the parallel std::sort
with std::execution::par
took even longer than the regular std::sort
. I'm curious to understand the underlying reasons for these performance results.
Here's the code I used for the benchmark:
#include <algorithm>
#include <chrono>
#include <iostream>
#include <execution>
#include <vector>
#include <cassert>
#include <omp.h> // Include the OpenMP header
#define N 20'000'000
bool is_sorted(std::vector<int> &v) {
for (int i = 0; i < v.size() - 1; i++)
if (v[i] > v[i + 1]) return false;
return true;
}
// Fixed solution from https://stackoverflow.com/a/56079287/2612235
void omp_sort(std::vector<int> &v) {
const int data_count = v.size();
auto get_block_edge = [data_count](int i, int n) {
return data_count * i / n;
};
int blocks = 0;
#pragma omp parallel
{
blocks = omp_get_num_threads();
int block = omp_get_thread_num();
int start = get_block_edge(block, blocks);
int finish = get_block_edge(block + 1, blocks);
std::sort(std::begin(v) + start, std::begin(v) + finish);
}
for (int merge_step = 1; merge_step < blocks; merge_step *= 2) {
#pragma omp parallel for
for (int i = 0; i < blocks; i += 2 * merge_step) {
int start = get_block_edge(i, blocks);
int mid = std::min(get_block_edge(i + merge_step, blocks), data_count);
int finish = std::min(get_block_edge(i + 2 * merge_step, blocks), data_count);
if (mid < finish)
std::inplace_merge(std::begin(v) + start, std::begin(v) + mid, std::begin(v) + finish);
}
}
}
int main()
{
std::cout << "Building a vector of random integers..." << std::endl;
std::vector<int> data(N);
std::vector<int> v(N);
std::generate(data.begin(), data.end(), std::rand);
v = data;
std::cout << "Sorting with std::sort..." << std::endl;
auto start = std::chrono::high_resolution_clock::now();
std::sort(v.begin(), v.end());
auto end = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> elapsed = end - start;
std::cout << "Elapsed time: " << elapsed.count() << " s\n";
assert(is_sorted(v));
v = data;
std::cout << "Sorting with std::sort and std::execution::par..." << std::endl;
start = std::chrono::high_resolution_clock::now();
std::sort(std::execution::par, v.begin(), v.end());
end = std::chrono::high_resolution_clock::now();
elapsed = end - start;
std::cout << "Elapsed time: " << elapsed.count() << " s\n";
assert(is_sorted(v));
v = data;
std::cout << "Sorting with OpenMP..." << std::endl;
start = std::chrono::high_resolution_clock::now();
omp_sort(v);
end = std::chrono::high_resolution_clock::now();
elapsed = end - start;
std::cout << "Elapsed time: " << elapsed.count() << " s\n";
assert(is_sorted(v));
}
I ensured that the sorted results are indeed sorted correctly using the is_sorted
function. I'm particularly interested in why the std::execution::par
version didn't perform as well as I expected, and why the OpenMP version is so much faster. Is there any room for improvement in the code, or is this performance discrepancy attributable to the nature of the algorithms and the overhead of managing parallel tasks?
My setup:
Processor Intel(R) Core(TM) i9-10900KF CPU @ 3.70GHz, 3696 Mhz, 10 Core(s), 20 Logical Processor(s) Installed Physical Memory (RAM) 128 GB
When I executed the (original) code on Linux, there existed an error of error: ‘std::execution’ has not been declared
. There is a post on intel community: https://community.intel.com/t5/Intel-oneAPI-Threading-Building/std-execution-par-error/m-p/1388112?utm_source=chatgpt.com
Then I added #include <tbb/tbb.h>
to the header of benchmark code and linked it to tbb lib (I used cmake on Linux), and here are the results:
Building a vector of random integers...
Sorting with std::sort...
Elapsed time: 1.74099 s
Sorting with std::sort and std::execution::par...
Elapsed time: 0.107865 s
Sorting with OpenMP...
Elapsed time: 0.278657 s
I guess it's the reason why you didn't obtain superior results of parallel std::sort
utilizing std::execution::par
.
Another question occurred at Why I don’t see speed improvement using std::execution with GCC?, which states that ... when you include <execution> it checks if it can find the tbb headers. If they available, it includes them and uses tbb as a backend. If it can't find a backend to use, it will fallback to std::execution::seq...
, quoted from Ted Lyngmo.
I hope this can answer your question.