I have a extremely large data.table with 1.6x10^8 rows and I want to perform a row-wise operation between the columns exposure
and exposure.before.index
, as seen in my example below.
I created the column TI (i.e. Treatment Intensification) that is an indicator of whether a not an ID is currently on a drug/drugs, exposure
, that is different from any drugs they were on at each ID's respective first row, exposure.before.index
. You can review my code and observe that the final output is as explained.
library(data.table)
DT <- data.table::data.table(ID=c("a","a","a","b","b","c","c"),
drugA=c(1,1,1,0,0,0,0),
drugB=c(0,1,1,1,0,0,0),
drugC=c(0,0,1,0,1,0,0))
DT[, exposure := gsub("NA\\+|\\+NA", "", do.call(paste,
c(Map(function(x, y) names(.SD)[(NA^!x) * y], .SD,
seq_along(.SD)), sep="+"))), .SDcols = drugA:drugC]
DT[exposure=="NA",exposure:="NONE"]
DT[,exposure.before.index:=c("drugA","drugA","drugA","drugB","drugB","NONE","NONE")]
DT[,CNT:=1:.N]
DT[!(exposure.before.index!="NONE" & exposure=="NONE"),TI:=(any(!unlist(strsplit(exposure, "[+]"))%in%unlist(strsplit(exposure.before.index, "[+]")))),by="CNT"]
DT[is.na(TI),TI:=FALSE]
DT
ID drugA drugB drugC exposure exposure.before.index CNT TI
1: a 1 0 0 drugA drugA 1 FALSE
2: a 1 1 0 drugA+drugB drugA 2 TRUE
3: a 1 1 1 drugA+drugB+drugC drugA 3 TRUE
4: b 0 1 0 drugB drugB 4 FALSE
5: b 0 0 1 drugC drugB 5 TRUE
6: c 0 0 0 NONE NONE 6 FALSE
7: c 0 0 0 NONE NONE 7 FALSE
I created CNT in order to apply my function any(!unlist(strsplit(exposure, "[+]"))%in%unlist(strsplit(exposure.before.index, "[+]")))
between exposure
and exposure.before.index
. Due to the 1.6x10^8 rows that I have this method is taking quite some time. I've usually use this data.table[...,by="CNT"] technique when I want to apply a certain operation/function row-wise, but I'm finding this not robust for extremely large data.table's. Is there other methods that some of y'all have that are more robust than my method?
I've found other questions similar to my topic but the answers weren't generalized for applying a row-wise operation on a user-defined function in a robust manner.
Any help and/or advice is appreciated.
This is difficult. strsplit
will not be very memory efficient for this 100 million dataset - each row requires two lists to be made from strsplit
. My suggestion is to use a function and skip the by = 1:.N
step.
exposed = function(before, after) {
out = vector(length = length(before))
for (i in seq_along(before)) {
bef = before[i]
aft = after[i]
if (bef == "NONE" || aft == "NONE")
out[i] = FALSE
else
out[i] = any(!unlist(strsplit(aft, "[+]", fixed = TRUE), use.names = FALSE)%chin%unlist(strsplit(bef, "[+]", fixed = TRUE), use.names = FALSE))
}
return(out)
}
DT[, TI3 := exposed(exposure.before.index, exposure)]
> DT[, .(exposure.before.index, exposure, TI, TI3)]
exposure.before.index exposure TI TI3
1: drugA drugA FALSE FALSE
2: drugA drugA+drugB TRUE TRUE
3: drugA drugA+drugB+drugC TRUE TRUE
4: drugB drugB FALSE FALSE
5: drugB drugC TRUE TRUE
6: NONE NONE FALSE FALSE
7: NONE NONE FALSE FALSE
Note there are a few optimizations here:
%chin%
instead of %in%
which is a data.table utility function that is faster on character vectors than %in%
strsplit(..., fixed = TRUE)
to optimize - this isn't a regular expression we are using. Likely the biggest performance boost.unlist(..., use.names = FALSE)
The next step would be to turn the function into an Rcpp
which is not done here. Strings are more complicated than numbers in Rcpp
(at least for me).
Here's the performance of this function. For the 7 row example, this is 4 times faster. But as we increase the rows, the speed difference becomes less significant:
## 7 rows
Unit: microseconds
expr min lq mean median uq max
use_fx 375.801 395.251 662.582 409.751 431.351 21345.701
OP 1889.901 2021.601 2211.858 2096.101 2285.201 4042.801
## 700,000 rows
Unit: seconds
expr min lq mean median uq max
use_fx 4.409595 4.409595 4.409595 4.409595 4.409595 4.409595
OP 12.592520 12.592520 12.592520 12.592520 12.592520 12.592520
## 7,000,000 rows
Unit: seconds
expr min lq mean median uq max
use_fx 43.90979 43.90979 43.90979 43.90979 43.90979 43.90979
OP 130.16418 130.16418 130.16418 130.16418 130.16418 130.16418
## code used:
DT_big = DT[rep(seq_len(.N), 1e5)]
microbenchmark(
use_fx = DT_big[, TI3 := exposed(exposure.before.index, exposure)],
OP = {
DT_big[,CNT:=1:.N]
DT_big[!(exposure.before.index!="NONE" & exposure=="NONE"),TI:=(any(!unlist(strsplit(exposure, "[+]")) %in% unlist(strsplit(exposure.before.index, "[+]")))),by="CNT"]
DT_big[is.na(TI),TI:=FALSE]
}
, times = 1L
)
If you are interested in Rcpp
, this may be helpful: