rperformanceloopsfilterdata.table

data.table loop through filtered data without storing it


I am about to receive a dataset with about 100,000 rows and 100 columns, some numeric, some character. I have written some code in advance using base R and dplyr but I need to heavily rely on subsetting and am hence trying to make my code faster by rewriting it in data.table. I am also concerned about memory use, so I would like to avoid storing the filtered datasets I produce in my operations.

My main operations look like this:

for(s in 0:2){ 
# where 0=group1, 1=group2, 2=allData

  mean(data[group!=s,][["col1"]])
}

where function mean() is just used as an example. In my actual code I run the same statistical analysis for group 1, group 2 and the whole dataset.

How can I do this efficiently using data.table? Something like a magrittr-pipes structure but fast and where the temporarily filtered data does not get stored in memory.

Hope this makes sense (I am an R beginner).

PS: happy to accept functional solutions too, although it'd be good to have a loop solution too for ease of implementation within existing code.


Solution

  • I think you can do this all within the world of data.table:

    ## example data:
    library(data.table)
    data <- data.table(group = rep(1:2,each=3), col1=1:6)
    

    Group by each group value, and apply your function mean to col1 where the group does not match the current group (group != .BY). A second line would be needed to call the function for all the data with no grouping applied:

    data[, data[group != .BY, mean(col1)], by=group]$V1
    #[1] 5 2
    
    data[, mean(col1)]
    # [1] 3.5