I'm encountering an issue where my program is faster than std::reduce
with std::execution::par
enabled. I highly doubt that my code is more optimized, especially considering the fact that I am using std::accumulate
in my multi-threaded code. I suspect there is something wrong with my usage of std::reduce
.
I have also used a very long vector in the code, so the overhead of std::reduce
should not be a significant factor.
#include <array>
#include <chrono>
#include <execution>
#include <iostream>
#include <numeric>
#include <thread>
void accum(std::vector<int>& arr, size_t start, size_t end,
std::vector<long long>& sum, size_t index) {
sum[index] = std::accumulate(arr.begin() + start, arr.begin() + end, 0LL);
}
int main() {
// Initialization
std::vector<int> arr{};
size_t arr_size = 1'000'000'000;
for (size_t i{}; i < arr_size; i++) {
arr.push_back(i % 1000);
}
// First Method, Multithread
std::cout << "----------Multithread----------" << std::endl;
auto start = std::chrono::high_resolution_clock::now();
size_t num_thread = std::thread::hardware_concurrency();
if (num_thread == 0) {
num_thread = 1;
}
std::vector<long long> sum_threads{};
for (size_t i{}; i < num_thread; i++) {
sum_threads.push_back(0);
}
std::vector<std::thread> threads{};
size_t interval = arr_size / num_thread;
for (size_t i{}; i < num_thread; i++) {
size_t start{i * interval};
size_t end{(i + 1) * interval};
if (i + 1 == num_thread) {
end = arr_size;
}
threads.emplace_back(accum, std::ref(arr), start, end,
std::ref(sum_threads), i);
}
for (size_t i{}; i < num_thread; i++) {
threads[i].join();
}
long long sum = std::accumulate(sum_threads.begin(), sum_threads.end(), 0LL);
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << sum << std::endl;
std::cout << duration.count() << std::endl;
// Second Method std::accumulate
std::cout << "----------std::accumulate----------" << std::endl;
start = std::chrono::high_resolution_clock::now();
sum = std::accumulate(arr.begin(), arr.end(), 0LL);
end = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << sum << std::endl;
std::cout << duration.count() << std::endl;
std::cout << "----------std::reduce----------" << std::endl;
// Third Method std::reduce with parallel execution
start = std::chrono::high_resolution_clock::now();
sum = std::reduce(std::execution::par, arr.begin(), arr.end(), 0LL);
end = std::chrono::high_resolution_clock::now();
duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
std::cout << sum << std::endl;
std::cout << duration.count() << std::endl;
return 0;
}
this is my output:
----------Multithread----------
499500000000
42
----------std::accumulate----------
499500000000
250
----------std::reduce----------
499500000000
98
and this is my CMakeLists.txt:
# Specify the minimum required version of CMake for this project.
cmake_minimum_required(VERSION 3.20)
project(Thread)
# Set the C++ standard to C++23 and make it a required standard.
# If the compiler does not support C++23, CMake will produce an error.
set(CMAKE_CXX_STANDARD 23)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
include_directories(include/)
add_executable(main
src/main.cpp
)
# Set compiler flags for C++.
# -Wall, -Wextra, -Werror, and -Wpedantic are used for stricter warnings and error handling.
set(CMAKE_CXX_FLAGS "-Wall -Wextra -Werror -Wpedantic -O3")
I was expecting that std::reduce
would result in faster execution.
I am running this on a Mac Book Pro, M3 Pro chip, and I am using a gcc(15.1) docker container.
You need to use std::execution::par_unseq
instead of just par
to get the most performance out of std::reduce
par
doesn't allow overlapping execution (vectorization), whereas par_unseq
allows vectorization, see execution policies. specicially libstdc++
seems to use vectorized reduction with par_unseq
sum = std::reduce(std::execution::par_unseq, arr.begin(), arr.end(), 0LL);
If you reduce the vector size to below 1 million elements so that the thread spawning overhead is noticeable then std::reduce(std::execution::par_unseq,...
could be a lot faster than your code as it typically uses a threadpool.
On a side note: I had to link in tbb to enable libstdc++
to use the parallel backend on Ubuntu, so make sure std::reduce
is actually using all your cores.
find_package(TBB REQUIRED)
target_link_libraries(main PRIVATE TBB::tbb)