rdata.tablerobustrowwise

data.table: Perform efficient row-wise operation on large data.table with columns as input


I have a extremely large data.table with 1.6x10^8 rows and I want to perform a row-wise operation between the columns exposure and exposure.before.index, as seen in my example below.

I created the column TI (i.e. Treatment Intensification) that is an indicator of whether a not an ID is currently on a drug/drugs, exposure, that is different from any drugs they were on at each ID's respective first row, exposure.before.index. You can review my code and observe that the final output is as explained.

library(data.table)
DT <- data.table::data.table(ID=c("a","a","a","b","b","c","c"),
                             drugA=c(1,1,1,0,0,0,0),
                             drugB=c(0,1,1,1,0,0,0),
                             drugC=c(0,0,1,0,1,0,0))
DT[, exposure := gsub("NA\\+|\\+NA", "", do.call(paste, 
                                                 c(Map(function(x, y) names(.SD)[(NA^!x) * y], .SD, 
                                                       seq_along(.SD)), sep="+"))), .SDcols = drugA:drugC]
DT[exposure=="NA",exposure:="NONE"]
DT[,exposure.before.index:=c("drugA","drugA","drugA","drugB","drugB","NONE","NONE")]
DT[,CNT:=1:.N]
DT[!(exposure.before.index!="NONE" & exposure=="NONE"),TI:=(any(!unlist(strsplit(exposure, "[+]"))%in%unlist(strsplit(exposure.before.index, "[+]")))),by="CNT"]
DT[is.na(TI),TI:=FALSE]
DT

   ID drugA drugB drugC          exposure exposure.before.index CNT    TI
1:  a     1     0     0             drugA                 drugA   1 FALSE
2:  a     1     1     0       drugA+drugB                 drugA   2  TRUE
3:  a     1     1     1 drugA+drugB+drugC                 drugA   3  TRUE
4:  b     0     1     0             drugB                 drugB   4 FALSE
5:  b     0     0     1             drugC                 drugB   5  TRUE
6:  c     0     0     0              NONE                  NONE   6 FALSE
7:  c     0     0     0              NONE                  NONE   7 FALSE

I created CNT in order to apply my function any(!unlist(strsplit(exposure, "[+]"))%in%unlist(strsplit(exposure.before.index, "[+]"))) between exposure and exposure.before.index. Due to the 1.6x10^8 rows that I have this method is taking quite some time. I've usually use this data.table[...,by="CNT"] technique when I want to apply a certain operation/function row-wise, but I'm finding this not robust for extremely large data.table's. Is there other methods that some of y'all have that are more robust than my method?

I've found other questions similar to my topic but the answers weren't generalized for applying a row-wise operation on a user-defined function in a robust manner.

Any help and/or advice is appreciated.


Solution

  • This is difficult. strsplit will not be very memory efficient for this 100 million dataset - each row requires two lists to be made from strsplit. My suggestion is to use a function and skip the by = 1:.N step.

    exposed = function(before, after) {
      out = vector(length = length(before))
      for (i in seq_along(before)) {
        bef = before[i]
        aft = after[i]
        if (bef == "NONE" || aft == "NONE") 
          out[i] = FALSE
        else
          out[i] = any(!unlist(strsplit(aft, "[+]", fixed = TRUE), use.names = FALSE)%chin%unlist(strsplit(bef, "[+]", fixed = TRUE), use.names = FALSE))
      }
      return(out)
    }
    
    DT[, TI3 := exposed(exposure.before.index, exposure)]
    
    > DT[, .(exposure.before.index, exposure, TI, TI3)]
       exposure.before.index          exposure    TI   TI3
    1:                 drugA             drugA FALSE FALSE
    2:                 drugA       drugA+drugB  TRUE  TRUE
    3:                 drugA drugA+drugB+drugC  TRUE  TRUE
    4:                 drugB             drugB FALSE FALSE
    5:                 drugB             drugC  TRUE  TRUE
    6:                  NONE              NONE FALSE FALSE
    7:                  NONE              NONE FALSE FALSE
    

    Note there are a few optimizations here:

    1. Using %chin% instead of %in% which is a utility function that is faster on character vectors than %in%
    2. Using strsplit(..., fixed = TRUE) to optimize - this isn't a regular expression we are using. Likely the biggest performance boost.
    3. unlist(..., use.names = FALSE)

    The next step would be to turn the function into an Rcpp which is not done here. Strings are more complicated than numbers in Rcpp (at least for me).

    Here's the performance of this function. For the 7 row example, this is 4 times faster. But as we increase the rows, the speed difference becomes less significant:

    ## 7 rows
    Unit: microseconds
       expr      min       lq     mean   median       uq       max
     use_fx  375.801  395.251  662.582  409.751  431.351 21345.701
         OP 1889.901 2021.601 2211.858 2096.101 2285.201  4042.801
    
    ## 700,000 rows
    Unit: seconds
       expr       min        lq      mean    median        uq       max
     use_fx  4.409595  4.409595  4.409595  4.409595  4.409595  4.409595
         OP 12.592520 12.592520 12.592520 12.592520 12.592520 12.592520
    
    ## 7,000,000 rows
    Unit: seconds
       expr       min        lq      mean    median        uq       max
     use_fx  43.90979  43.90979  43.90979  43.90979  43.90979  43.90979
         OP 130.16418 130.16418 130.16418 130.16418 130.16418 130.16418
    
    ## code used:
    DT_big = DT[rep(seq_len(.N), 1e5)]
    microbenchmark(
      use_fx = DT_big[, TI3 := exposed(exposure.before.index, exposure)],
      OP = {
        DT_big[,CNT:=1:.N]
        DT_big[!(exposure.before.index!="NONE" & exposure=="NONE"),TI:=(any(!unlist(strsplit(exposure, "[+]")) %in% unlist(strsplit(exposure.before.index, "[+]")))),by="CNT"]
        DT_big[is.na(TI),TI:=FALSE]
      }
      , times = 1L
    )
    
    

    If you are interested in Rcpp, this may be helpful:

    https://wckdouglas.github.io/2015/05/string-manipulation