I have a list of same-length vectors that I need to run different statistical functions on per vector element across the list. I know I can do this with apply
by creating a data frame first and then running the function row-wise:
apply(X = do.call(what = "data.frame", args = foo), MARGIN = 1, FUN = "sum", na.rm = TRUE)
This process is very slow though when run a couple hundred thousand times.
With Reduce
I managed to get a partial solution, e.g. when using +
instead of sum
, which is a lot faster:
Reduce(f = "+", x = foo)
Though, I did not manage to get the pure Reduce
version working with functions like mean
, sd
, and others, as well as regarding the argument na.rm = TRUE
.
A sub-optimal solution to speed things up a little while relying on apply
still is just speeding up the matrix creation with Reduce
:
apply(X = Reduce(function(...) cbind(...), foo), MARGIN = 1, FUN = "sum", na.rm = TRUE)
Here is reproducible example and a comparison of the solutions I came up with so far.
foo <- lapply(X = 1:1e2, FUN = function(x) 1:10)
microbenchmark::microbenchmark("apply_do.call" = apply(X = do.call(what = "data.frame", args = foo), MARGIN = 1, FUN = "sum", na.rm = TRUE),
"apply_Reduce" = apply(X = Reduce(f = function(...) cbind(...), x = foo), MARGIN = 1, FUN = "sum", na.rm = TRUE),
"Reduce" = Reduce(f = "+", x = foo))
Which results in the following output on my machine (Linux)
Unit: microseconds
expr min lq mean median uq max neval cld
apply_do.call 4256.337 4406.775 5211.12032 4603.8670 5180.1275 13508.14 100 c
apply_Reduce 292.775 326.124 480.88223 346.5525 436.1935 7559.97 100 b
Reduce 37.505 43.197 51.53856 46.7935 55.2790 131.72 100 a
As you can see, the potential for computational improvements is drastic with the Reduce
version only needing about 1 % of computation time compared to the first example.
Is there a way to get an efficient solution to my problem (compatibility with min, max, sum, mean, sd, ...) without relying on external libraries? Bonus points if the answer manages to correctly compute statistics like mean
and sd
as well as preserving object types like POSIXct
.
The matrixStats package is your friend. apply
is designed for matrices and applying it on data.frames is very inefficient. Since you appear to have only numbers, there's no need for a data.frame. Try using cbind
instead, to get a matrix. In addition, use the cpp functions like matrixStats::rowSums2
.
> microbenchmark::microbenchmark(
+ apply_df=apply(do.call("data.frame", foo), 1, sum, na.rm=TRUE),
+ apply_mat=apply(do.call("cbind", foo), 1, sum, na.rm=TRUE),
+ ms=matrixStats::rowSums2(do.call(what="cbind", args=foo)),
+ "Reduce"=Reduce(f="+", x=foo),
+ check='equal')
Unit: microseconds
expr min lq mean median uq max neval cld
apply_df 9210.000 9446.8110 10276.74015 9648.5480 10090.6055 18426.747 100 a
apply_mat 93.513 97.9100 106.15389 103.0785 107.2585 147.374 100 b
ms 39.493 42.3560 49.29303 47.4105 53.4270 99.889 100 b
Reduce 72.032 75.5015 84.13138 80.2595 86.5755 128.951 100 b
There's also matrixStats::rowMins
, matrixStats::rowMaxs
, matrixStats::rowSds
and many more, could get your friend.