c++stdc++20stdvector

How to parallelize a sum over an arbitrary column of a vector of vectors using C++ standard library execution policies?


I'm trying to parallelize a part of a larger program using the C++ standard library and its execution policies. The original program uses std::accumulate to calculate sums over columns of 2d vectors (vectors of vectors) but since std::accumulate doesn't accept execution policies I'm trying to find a parallelizable alternative.

I tried switching to using std::reduce instead of std::accumulate. I'm quite new to C++, but from what I gathered from the C++ reference they should work quite similarly. However, my code does not compile after making this change. Why doesn't the modified code work and how can I fix it? Is there a better way to parallelize the sum over a column of a 2d vector (vector of vectors) using the C++ standard library? The implementation should work for both CPU and GPU parallelization.

Minimal reproducible example:

#include <iostream>
#include <vector>
#include <algorithm>
#include <numeric>

int main(int argc, char *argv[])
{
    std::vector<std::vector<double>> vec = {{1,2,3}, {4,5,6}, {7,8,9}};

    // working sequential version of sum over second column of vec
    double res = std::accumulate(vec.begin(), vec.end(), 0.0, [&](auto sum, auto b) { return sum + b[1]; });

    std::cout << res << std::endl; // prints 15 as expected

    // same but with reduce, does not compile
    res = std::reduce(vec.begin(), vec.end(), 0.0, [&](auto sum, auto b) { return sum + b[1]; });

    std::cout << res << std::endl;
}

Trying to compile this, I get the errors

g++-10 -std=c++20 program.cpp

program.cpp:16:90: error: subscripted value is neither array nor pointer
   16 |     res = std::reduce(vec.begin(), vec.end(), 0.0, [&](auto sum, auto b) { return sum + b[1]; });
      |                   

program.cpp:16:87: error: no match for ‘operator+’ (operand types are ‘std::vector<double>’ and ‘__gnu_cxx::__alloc_traits<std::allocator<double>, double>::value_type’ {aka ‘double’})
   16 |     res = std::reduce(vec.begin(), vec.end(), 0.0, [&](auto sum, auto b) { return sum + b[1]; });
      |                                                                                   ~~~~^~~~

Solution

  • Re: how to compactly achieve your goal. You are looking for std::transform_reduce - it separates the action of transforming the input (in your case, selecting one element from each vector) from that of reducing the resulting transformed range. Like this:

    res = std::transform_reduce(
        vec.begin(), vec.end(), 0.0,
        std::plus{},
        [](const auto& v) { return v[1]; });
    

    Demo. Add execution policy to taste.


    Re: why your attempt using std::reduce doesn't compile. When the type (say, V) of the initial value doesn't match that (say, E) of the range element, std::reduce expects a binary predicate that can take any combination of the two - four combinations in all. This is necessary to implement parallel execution.

    E.g. in the simplest case, std::reduce could split the range in two, reduce each half, and combine the results. But then, it wouldn't have the initial value for the second half; so it needs to be able to call pred(E, E), to start working on that half. And in the end, it would have two scalar values in hand, and would need pred(V, V) to combine them.

    This could be achieved by writing a named class with four overloads of operator(). Or, much more verbosely, by writing a lambda with a chain of if constexpr - something like

    [](const auto& a, const auto& b) {
      if constexpr(std::is_same_v<double, std::decay_t<decltype(a)>> &&
                   std::is_same_v<double, std::decay_t<decltype(b)>>) {
        return a + b;
      }
      // ...
    }
    

    Or, again, use std::transform_reduce as a better fit for the original problem.