c++multithreading

std::reduce with std::execution::par seems to do nothing


I'm encountering an issue where my program is faster than std::reduce with std::execution::par enabled. I highly doubt that my code is more optimized, especially considering the fact that I am using std::accumulate in my multi-threaded code. I suspect there is something wrong with my usage of std::reduce.

I have also used a very long vector in the code, so the overhead of std::reduce should not be a significant factor.

#include <array>
#include <chrono>
#include <execution>
#include <iostream>
#include <numeric>
#include <thread>

void accum(std::vector<int>& arr, size_t start, size_t end,
           std::vector<long long>& sum, size_t index) {
    sum[index] = std::accumulate(arr.begin() + start, arr.begin() + end, 0LL);
}

int main() {
    // Initialization
    std::vector<int> arr{};
    size_t arr_size = 1'000'000'000;
    for (size_t i{}; i < arr_size; i++) {
        arr.push_back(i % 1000);
    }

    // First Method, Multithread
    std::cout << "----------Multithread----------" << std::endl;
    auto start = std::chrono::high_resolution_clock::now();
    size_t num_thread = std::thread::hardware_concurrency();
    if (num_thread == 0) {
        num_thread = 1;
    }

    std::vector<long long> sum_threads{};
    for (size_t i{}; i < num_thread; i++) {
        sum_threads.push_back(0);
    }

    std::vector<std::thread> threads{};
    size_t interval = arr_size / num_thread;
    for (size_t i{}; i < num_thread; i++) {
        size_t start{i * interval};
        size_t end{(i + 1) * interval};
        if (i + 1 == num_thread) {
            end = arr_size;
        }
        threads.emplace_back(accum, std::ref(arr), start, end,
                             std::ref(sum_threads), i);
    }

    for (size_t i{}; i < num_thread; i++) {
        threads[i].join();
    }

    long long sum = std::accumulate(sum_threads.begin(), sum_threads.end(), 0LL);
    auto end = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    std::cout << sum << std::endl;
    std::cout << duration.count() << std::endl;
    
    // Second Method std::accumulate
    std::cout << "----------std::accumulate----------" << std::endl;
    start = std::chrono::high_resolution_clock::now();
    sum = std::accumulate(arr.begin(), arr.end(), 0LL);
    end = std::chrono::high_resolution_clock::now();
    duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    std::cout << sum << std::endl;
    std::cout << duration.count() << std::endl;

    std::cout << "----------std::reduce----------" << std::endl;
    // Third Method std::reduce with parallel execution
    start = std::chrono::high_resolution_clock::now();
    sum = std::reduce(std::execution::par, arr.begin(), arr.end(), 0LL);
    end = std::chrono::high_resolution_clock::now();
    duration = std::chrono::duration_cast<std::chrono::milliseconds>(end - start);
    std::cout << sum << std::endl;
    std::cout << duration.count() << std::endl;

    return 0;
}

this is my output:

----------Multithread----------
499500000000
42
----------std::accumulate----------
499500000000
250
----------std::reduce----------
499500000000
98

and this is my CMakeLists.txt:

# Specify the minimum required version of CMake for this project. 
cmake_minimum_required(VERSION 3.20)

project(Thread)

# Set the C++ standard to C++23 and make it a required standard.
# If the compiler does not support C++23, CMake will produce an error.
set(CMAKE_CXX_STANDARD 23)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

include_directories(include/)

add_executable(main
        src/main.cpp
)

# Set compiler flags for C++.
# -Wall, -Wextra, -Werror, and -Wpedantic are used for stricter warnings and error handling.
set(CMAKE_CXX_FLAGS "-Wall -Wextra -Werror -Wpedantic -O3")

I was expecting that std::reduce would result in faster execution.

I am running this on a Mac Book Pro, M3 Pro chip, and I am using a gcc(15.1) docker container.


Solution

  • You need to use std::execution::par_unseq instead of just par to get the most performance out of std::reduce

    par doesn't allow overlapping execution (vectorization), whereas par_unseq allows vectorization, see execution policies. specicially libstdc++ seems to use vectorized reduction with par_unseq

    sum = std::reduce(std::execution::par_unseq, arr.begin(), arr.end(), 0LL);
    

    If you reduce the vector size to below 1 million elements so that the thread spawning overhead is noticeable then std::reduce(std::execution::par_unseq,... could be a lot faster than your code as it typically uses a threadpool.


    On a side note: I had to link in tbb to enable libstdc++ to use the parallel backend on Ubuntu, so make sure std::reduce is actually using all your cores.

    find_package(TBB REQUIRED)
    target_link_libraries(main PRIVATE TBB::tbb)