I'm trying to parallelize a part of a larger program using the C++ standard library and its execution policies. The original program uses std::accumulate
to calculate sums over columns of 2d vectors (vectors of vectors) but since std::accumulate
doesn't accept execution policies I'm trying to find a parallelizable alternative.
I tried switching to using std::reduce
instead of std::accumulate
. I'm quite new to C++, but from what I gathered from the C++ reference they should work quite similarly. However, my code does not compile after making this change. Why doesn't the modified code work and how can I fix it? Is there a better way to parallelize the sum over a column of a 2d vector (vector of vectors) using the C++ standard library? The implementation should work for both CPU and GPU parallelization.
Minimal reproducible example:
#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>
int main(int argc, char *argv[])
{
std::vector<std::vector<double>> vec = {{1,2,3}, {4,5,6}, {7,8,9}};
// working sequential version of sum over second column of vec
double res = std::accumulate(vec.begin(), vec.end(), 0.0, [&](auto sum, auto b) { return sum + b[1]; });
std::cout << res << std::endl; // prints 15 as expected
// same but with reduce, does not compile
res = std::reduce(vec.begin(), vec.end(), 0.0, [&](auto sum, auto b) { return sum + b[1]; });
std::cout << res << std::endl;
}
Trying to compile this, I get the errors
g++-10 -std=c++20 program.cpp
program.cpp:16:90: error: subscripted value is neither array nor pointer
16 | res = std::reduce(vec.begin(), vec.end(), 0.0, [&](auto sum, auto b) { return sum + b[1]; });
|
program.cpp:16:87: error: no match for ‘operator+’ (operand types are ‘std::vector<double>’ and ‘__gnu_cxx::__alloc_traits<std::allocator<double>, double>::value_type’ {aka ‘double’})
16 | res = std::reduce(vec.begin(), vec.end(), 0.0, [&](auto sum, auto b) { return sum + b[1]; });
| ~~~~^~~~
Re: how to compactly achieve your goal. You are looking for std::transform_reduce
- it separates the action of transforming the input (in your case, selecting one element from each vector) from that of reducing the resulting transformed range. Like this:
res = std::transform_reduce(
vec.begin(), vec.end(), 0.0,
std::plus{},
[](const auto& v) { return v[1]; });
Demo. Add execution policy to taste.
Re: why your attempt using std::reduce
doesn't compile. When the type (say, V
) of the initial value doesn't match that (say, E
) of the range element, std::reduce
expects a binary predicate that can take any combination of the two - four combinations in all. This is necessary to implement parallel execution.
E.g. in the simplest case, std::reduce
could split the range in two, reduce each half, and combine the results. But then, it wouldn't have the initial value for the second half; so it needs to be able to call pred(E, E)
, to start working on that half. And in the end, it would have two scalar values in hand, and would need pred(V, V)
to combine them.
This could be achieved by writing a named class with four overloads of operator()
. Or, much more verbosely, by writing a lambda with a chain of if constexpr
- something like
[](const auto& a, const auto& b) {
if constexpr(std::is_same_v<double, std::decay_t<decltype(a)>> &&
std::is_same_v<double, std::decay_t<decltype(b)>>) {
return a + b;
}
// ...
}
Or, again, use std::transform_reduce
as a better fit for the original problem.