rperformanceapplyreducemicrobenchmark

Computationally efficient alternative to row-wise apply on a list with same-length vectors


I have a list of same-length vectors that I need to run different statistical functions on per vector element across the list. I know I can do this with apply by creating a data frame first and then running the function row-wise:

apply(X = do.call(what = "data.frame", args = foo), MARGIN = 1, FUN = "sum", na.rm = TRUE)

This process is very slow though when run a couple hundred thousand times. With Reduce I managed to get a partial solution, e.g. when using + instead of sum, which is a lot faster:

Reduce(f = "+", x = foo)

Though, I did not manage to get the pure Reduce version working with functions like mean, sd, and others, as well as regarding the argument na.rm = TRUE.

A sub-optimal solution to speed things up a little while relying on apply still is just speeding up the matrix creation with Reduce:

apply(X = Reduce(function(...) cbind(...), foo), MARGIN = 1, FUN = "sum", na.rm = TRUE)

Here is reproducible example and a comparison of the solutions I came up with so far.

foo <- lapply(X = 1:1e2, FUN = function(x) 1:10)
microbenchmark::microbenchmark("apply_do.call" = apply(X = do.call(what = "data.frame", args = foo), MARGIN = 1, FUN = "sum", na.rm = TRUE),
                               "apply_Reduce" = apply(X = Reduce(f = function(...) cbind(...), x = foo), MARGIN = 1, FUN = "sum", na.rm = TRUE),
                               "Reduce" = Reduce(f = "+", x = foo))

Which results in the following output on my machine (Linux)

Unit: microseconds
          expr      min       lq       mean    median        uq      max neval cld
 apply_do.call 4256.337 4406.775 5211.12032 4603.8670 5180.1275 13508.14   100   c
  apply_Reduce  292.775  326.124  480.88223  346.5525  436.1935  7559.97   100  b 
        Reduce   37.505   43.197   51.53856   46.7935   55.2790   131.72   100 a 

As you can see, the potential for computational improvements is drastic with the Reduce version only needing about 1 % of computation time compared to the first example.

Is there a way to get an efficient solution to my problem (compatibility with min, max, sum, mean, sd, ...) without relying on external libraries? Bonus points if the answer manages to correctly compute statistics like mean and sd as well as preserving object types like POSIXct.


Solution

  • The matrixStats package is your friend. apply is designed for matrices and applying it on data.frames is very inefficient. Since you appear to have only numbers, there's no need for a data.frame. Try using cbind instead, to get a matrix. In addition, use the cpp functions like matrixStats::rowSums2.

    > microbenchmark::microbenchmark(
    +   apply_df=apply(do.call("data.frame", foo), 1, sum, na.rm=TRUE),
    +   apply_mat=apply(do.call("cbind", foo), 1, sum, na.rm=TRUE),
    +   ms=matrixStats::rowSums2(do.call(what="cbind", args=foo)), 
    +   "Reduce"=Reduce(f="+", x=foo),
    +   check='equal')
    Unit: microseconds
          expr      min        lq        mean    median         uq       max neval cld
      apply_df 9210.000 9446.8110 10276.74015 9648.5480 10090.6055 18426.747   100  a 
     apply_mat   93.513   97.9100   106.15389  103.0785   107.2585   147.374   100   b
            ms   39.493   42.3560    49.29303   47.4105    53.4270    99.889   100   b
        Reduce   72.032   75.5015    84.13138   80.2595    86.5755   128.951   100   b
    

    There's also matrixStats::rowMins, matrixStats::rowMaxs, matrixStats::rowSds and many more, could get your friend.