I am about to receive a dataset with about 100,000 rows and 100 columns, some numeric, some character. I have written some code in advance using base R and dplyr but I need to heavily rely on subsetting and am hence trying to make my code faster by rewriting it in data.table. I am also concerned about memory use, so I would like to avoid storing the filtered datasets I produce in my operations.
My main operations look like this:
for(s in 0:2){
# where 0=group1, 1=group2, 2=allData
mean(data[group!=s,][["col1"]])
}
where function mean()
is just used as an example. In my actual code I run the same statistical analysis for group 1, group 2 and the whole dataset.
How can I do this efficiently using data.table? Something like a magrittr-pipes structure but fast and where the temporarily filtered data does not get stored in memory.
Hope this makes sense (I am an R beginner).
PS: happy to accept functional solutions too, although it'd be good to have a loop solution too for ease of implementation within existing code.
I think you can do this all within the world of data.table:
## example data:
library(data.table)
data <- data.table(group = rep(1:2,each=3), col1=1:6)
Group by each group
value, and apply your function mean
to col1
where the group does not match the current group (group != .BY
). A second line would be needed to call the function for all the data with no grouping applied:
data[, data[group != .BY, mean(col1)], by=group]$V1
#[1] 5 2
data[, mean(col1)]
# [1] 3.5