rperformancedata.tablequery-optimizationlapply

Improve processing time of applying a function over a vector and grouping by columns


I am trying to apply a function over data.table columns, and grouping by columns value.

I am using the lapply fuction, but my script is quite slow.

To give some context, I am working of probability values:

Here is a reproducible example with dummy values:

###########
# Dummy data
set.seed(99)
n_col <- 4
size <- 3e6
num_group2 <- 10
vec_1 <- paste0("PD_1_N", (0:n_col))
vec_2 <- paste0("PD_2_N", (0:n_col))
vec_3 <- paste0("PD_3_N", (0:n_col))
id <- rep(seq(1, size, 1), num_group2)
group_1 <- rep(sample(seq(1, size, 1), size=size, replace=TRUE), num_group2)
group_2 <- sort(rep(seq(1, num_group2, 1), size))
factor <- runif(size*num_group2, 0.5, 4)
data <- data.table(id, group_1, group_2, factor)
data[, vec_1] <- data.table(rep(runif(size, 0, 0.5), num_group2), 
                            rep(runif(size, 0, 0.5), num_group2), 
                            rep(runif(size, 0, 0.5), num_group2), 
                            rep(runif(size, 0, 0.5), num_group2), 
                            rep(runif(size, 0, 0.5), num_group2))
###############
# lapply step 1
t <- Sys.time()
data[, (vec_2) := lapply(.SD, function(x) pmin(1, factor*x)), .SDcols=vec_1]
Sys.time() - t

###############
# lapply step 2
t <- Sys.time()
data[, (vec_3) := lapply(.SD, function(x) 1 - prod((1 - x))), 
     by=c("group_1", "group_2"), .SDcols=vec_2]
Sys.time() - t

######################
# test: 2 steps in one
t <- Sys.time()
data[, (vec_3) := lapply(.SD, function(x) 1 - prod((1 - pmin(1, factor*x)))), 
     by=c("group_1", "group_2"), .SDcols=vec_1]
Sys.time() - t
# end test

Is there a way to improve the processing time of the step 2? I am also surprised that, when I try to combine the 2 steps in a unique line of code, it is actually much slower, around 10 mins (see "test: 2 steps in one" in the above code).


Solution

  • data.table has an optimized version of prod. The data.table authors call this “GForce” optimization. See the "What is this sorcery" section here.

    The issue here, is that it will not work if multiple operations are done together and inside the prod.

    We have to construct it step by step:

    # First, what is inside the prod. No need for grouping
    data[, (vec_3) := lapply(.SD, function(x) 1 - x), .SDcols = vec_2]
    
    # Now, the prod alone, by groups. Now we take vec_3 in .SDcols
    
    data[, (vec_3) := lapply(.SD, prod), 
         by = c("group_1", "group_2"), .SDcols = vec_3]
    
    # Now, the outer operation, no need for grouping
    data[, (vec_3) := lapply(.SD, function(x) 1 - x), .SDcols = vec_3]
    
    
    

    My results comparing both approaches:

    > data.table::setkey(data, group_1, group_2)
    > # lapply step 2
    > t <- Sys.time()
    > data[, (vec_3) := lapply(.SD, function(x) 1 - prod((1 - x))), 
    +      by=c("group_1", "group_2"), .SDcols=vec_2]
    > t <- Sys.time()
    > data[, (vec_3) := lapply(.SD, function(x) 1 - prod((1 - x))), 
    +      by=c("group_1", "group_2"), .SDcols=vec_2]
    > Sys.time() - t
    Time difference of 1.847846 mins
    
    > # Optmized
    > t <- Sys.time()
    > # First, what is inside the prod. No need for grouping
    > data[, (vec_3) := lapply(.SD, function(x) 1 - x), .SDcols = vec_2]
    > 
    > # Now, the prod alone, by groups. Now we take vec_3 in .SDcols
    > 
    > data[, (vec_3) := lapply(.SD, prod), 
    +      by = c("group_1", "group_2"), .SDcols = vec_3]
    > 
    > # Now, the outer operation, no need for grouping
    > data[, (vec_3) := lapply(.SD, function(x) 1 - x), .SDcols = vec_3]
    > Sys.time() - t
    Time difference of 4.312201 secs
    

    So 1.9 minutes vs 4.31 secs.

    See when is GForce optimizing or not with: (has some overhead)

    options(datatable.verbose = TRUE)

    Finding groups using uniqlist on key ... 0.540s elapsed (0.390s cpu) 
    Finding group sizes from the positions (can be avoided to save RAM) ... 0.110s elapsed (0.080s cpu) 
    lapply optimization changed j from 'lapply(.SD, prod)' to 'list(prod(PD_3_N0), prod(PD_3_N1), prod(PD_3_N2), prod(PD_3_N3), prod(PD_3_N4))'
    GForce optimized j to 'list(gprod(PD_3_N0), gprod(PD_3_N1), gprod(PD_3_N2), gprod(PD_3_N3), gprod(PD_3_N4))' (see ?GForce)
    Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.028
    gforce assign high and low took 0.025
    gforce eval took 1.122
    1.310s elapsed (0.970s cpu) 
    Assigning to 30000000 row subset of 30000000 rows
    RHS_list_of_columns == true
    

    The + sign deactivates GForce in this example, so most of the optimization is there

    > data[, (vec_3) := lapply(.SD, prod), .SDcols = vec_3, by = c("group_1", "group_2")]
    Finding groups using uniqlist on key ... 0.510s elapsed (0.420s cpu) 
    Finding group sizes from the positions (can be avoided to save RAM) ... 0.110s elapsed (0.090s cpu) 
    lapply optimization changed j from 'lapply(.SD, prod)' to 'list(prod(PD_3_N0), prod(PD_3_N1), prod(PD_3_N2), prod(PD_3_N3), prod(PD_3_N4))'
    GForce optimized j to 'list(gprod(PD_3_N0), gprod(PD_3_N1), gprod(PD_3_N2), gprod(PD_3_N3), gprod(PD_3_N4))' (see ?GForce)
    Making each group and running j (GForce TRUE) ... gforce initial population of grp took 0.028
    gforce assign high and low took 0.030
    gforce eval took 1.212
    1.390s elapsed (0.750s cpu) 
    Assigning to 30000000 row subset of 30000000 rows
    RHS_list_of_columns == true
    > Sys.time() - t
    Time difference of 3.031504 secs
    > t <- Sys.time()
    > data[, (vec_3) := lapply(.SD, function(x) +prod(x)), .SDcols = vec_3, by = c("group_1", "group_2")]
    Finding groups using uniqlist on key ... 0.530s elapsed (0.480s cpu) 
    Finding group sizes from the positions (can be avoided to save RAM) ... 0.110s elapsed (0.090s cpu) 
    lapply optimization changed j from 'lapply(.SD, function(x) +prod(x))' to 'list(..FUN1(PD_3_N0), ..FUN1(PD_3_N1), ..FUN1(PD_3_N2), ..FUN1(PD_3_N3), ..FUN1(PD_3_N4))'
    GForce is on, but not activated for this query; left j unchanged (see ?GForce)
    Old mean optimization is on, left j unchanged.
    Making each group and running j (GForce FALSE) ... 
      memcpy contiguous groups took 7.584s for 18964290 groups
      eval(j) took 63.648s for 18964290 calls
    00:01:30 elapsed (00:01:22 cpu) 
    > Sys.time() - t
    Time difference of 1.514286 mins