my data (df) contains ~2,0000K rows and ~5K unique names. For each unique name, I want to select all the rows from df which contains that specific name. For example, the data frame df looks as follows:
id names
1 A,B,D
2 A,B
3 A
4 B,D
5 C,E
6 C,D,E
7 A,E
I want to select all the rows which contains 'A' (A is among 5K unique names) in the column 'names'. So, the output will be:
id names
1 A,B,D
2 A,B
3 A
7 A,E
I am trying to do this using parallel processing using mclapply with number of nodes = 20 and 80 GB memory. Still I am getting out of memory issue.
Here is my code to select the rows containing specific name:
subset_select = function(x,df){
indx <- which(
rowSums(
`dim<-`(grepl(x, as.matrix(df), fixed=TRUE), dim(df))
) > 0
)
new_df = df[indx, ]
return(new_df)
}
df_subset = subset_select(name,df)
My question is: is there any other way to get the subset of data for each 5K unique names more efficiently (in terms of runtime and memory consumption)? TIA.
Here is a parallelized way with package parallel
.
First, the data set has 2M rows. The following code is meant to show it, not more. See the commented line after scan
.
x <- scan(file = "~/tmp/temp.txt")
#Read 2000000 items
df1 <- data.frame(id = seq_along(x), names = x)
Now the code.
The parallelized mclapply
loop breaks the data into chunks of N
rows each and processes them independently. Then, the return value inx2
must be unlist
ed.
library(parallel)
ncores <- detectCores() - 1L
pat <- "A"
t1 <- system.time({
inx1 <- grep(pat, df1$names)
})
t2 <- system.time({
N <- 10000L
iters <- seq_len(ceiling(nrow(df1) / N))
inx2 <- mclapply(iters, function(k){
i <- seq_len(N) + (k - 1L)*N
j <- grep(pat, df1[i, "names"])
i[j]
}, mc.cores = ncores)
inx2 <- unlist(inx2)
})
identical(df1[inx1, ], df1[inx2, ])
#[1] TRUE
rbind(t1, t2)
# user.self sys.self elapsed user.child sys.child
#t1 5.325 0.001 5.371 0.000 0.000
#t2 0.054 0.093 2.446 3.688 0.074
The mclapply
took less than half the time the straightforward grep
took.
R version 4.1.1 (2021-08-10) on Ubuntu 20.04.3 LTS.