I am trying to parallelise the addition of (large) vectors using RcppParallel. That's what I've come up with.
// [[Rcpp::depends(RcppParallel)]]
#include <RcppParallel.h>
#include <Rcpp.h>
#include <assert.h>
using namespace RcppParallel;
using namespace Rcpp;
// [[Rcpp::export]]
NumericVector directVectorAddition(NumericVector first, NumericVector second) {
assert (first.length() == second.length());
NumericVector results(first.length());
results = first + second;
return results;
}
// [[Rcpp::export]]
NumericVector loopVectorAddition(NumericVector first, NumericVector second) {
assert (first.length() == second.length());
NumericVector results(first.length());
for(unsigned i = 0; i != first.length(); i++)
results[i] = first[i] + second[i];
return results;
}
struct VectorAddition : public Worker
{
const RVector<double> first, second;
RVector<double> results;
VectorAddition(const NumericVector one, const NumericVector two, NumericVector three) : first(one), second(two), results(three) {}
void operator()(std::size_t a1, std::size_t a2) {
std::transform(first.begin() + a1, first.begin() + a2,
second.begin() + a1,
results.begin() + a1,
[](double i, double j) {return i + j;});
}
};
// [[Rcpp::export]]
NumericVector parallelVectorAddition(NumericVector first, NumericVector second) {
assert (first.length() == second.length());
NumericVector results(first.length());
VectorAddition myVectorAddition(first, second, results);
parallelFor(0, first.length(), myVectorAddition);
return results;
}
It seems to work, but doesn't speed up things (at least not on a 4-core machine).
> v1 <- 1:1000000
> v2 <- 1000000:1
> all(directVectorAddition(v1, v2) == loopVectorAddition(v1, v2))
[1] TRUE
> all(directVectorAddition(v1, v2) == parallelVectorAddition(v1, v2))
[1] TRUE
> result <- benchmark(v1 + v2, directVectorAddition(v1, v2), loopVectorAddition(v1, v2), parallelVectorAddition(v1, v2), order="relative")
> result[,1:4]
test replications elapsed relative
1 v1 + v2 100 0.206 1.000
4 parallelVectorAddition(v1, v2) 100 0.993 4.820
2 directVectorAddition(v1, v2) 100 1.015 4.927
3 loopVectorAddition(v1, v2) 100 1.056 5.126
Can this be implemented more efficiently?
Thanks a lot in advance,
mce
Rookie mistake :) You define this as Rcpp::NumericVector
but create data that is created via the sequence operator. And that creates integer values so you are forcing a copy onto all your functions!
Make it
v1 <- as.double(1:1000000)
v2 <- as.double(1000000:1)
instead, and on a machine with lots of cores (at work) I then see
R> result[,1:4]
test replications elapsed relative
4 parallelVectorAddition(v1, v2) 100 0.301 1.000
2 directVectorAddition(v1, v2) 100 0.424 1.409
1 v1 + v2 100 0.436 1.449
3 loopVectorAddition(v1, v2) 100 0.736 2.445
The example is still not that impressive because the relevant operation is "cheap" whereas the parallel approach needs to allocate memory, copy data to workers, collect again etc pp.
But the good news is that you wrote your parallel code correctly. Not a small task.