r parallel-processing out-of-memory mclapply

In Parallel processing, select all the rows which contains a specific keyword in r

my data (df) contains ~2,0000K rows and ~5K unique names. For each unique name, I want to select all the rows from df which contains that specific name. For example, the data frame df looks as follows:

id  names
1   A,B,D
2   A,B
3   A
4   B,D
5   C,E
6   C,D,E
7   A,E

I want to select all the rows which contains 'A' (A is among 5K unique names) in the column 'names'. So, the output will be:

id  names
1   A,B,D
2   A,B
3   A
7   A,E

I am trying to do this using parallel processing using mclapply with number of nodes = 20 and 80 GB memory. Still I am getting out of memory issue.

Here is my code to select the rows containing specific name:

subset_select = function(x,df){
  indx <- which(
    rowSums(
      `dim<-`(grepl(x, as.matrix(df), fixed=TRUE), dim(df))
    ) > 0
  )
  new_df = df[indx, ]
  return(new_df)
}

df_subset = subset_select(name,df)

My question is: is there any other way to get the subset of data for each 5K unique names more efficiently (in terms of runtime and memory consumption)? TIA.

Solution

Here is a parallelized way with package parallel.
First, the data set has 2M rows. The following code is meant to show it, not more. See the commented line after scan.

x <- scan(file = "~/tmp/temp.txt")
#Read 2000000 items
df1 <- data.frame(id = seq_along(x), names = x)

Now the code.
The parallelized mclapply loop breaks the data into chunks of N rows each and processes them independently. Then, the return value inx2 must be unlisted.

library(parallel)

ncores <- detectCores() - 1L
pat <- "A"

t1 <- system.time({
  inx1 <- grep(pat, df1$names)
})

t2 <- system.time({
  N <- 10000L
  iters <- seq_len(ceiling(nrow(df1) / N))
  inx2 <- mclapply(iters, function(k){
    i <- seq_len(N) + (k - 1L)*N
    j <- grep(pat, df1[i, "names"])
    i[j]
  }, mc.cores = ncores)
  inx2 <- unlist(inx2)
})

identical(df1[inx1, ], df1[inx2, ])  
#[1] TRUE

rbind(t1, t2)
#   user.self sys.self elapsed user.child sys.child
#t1     5.325    0.001   5.371      0.000     0.000
#t2     0.054    0.093   2.446      3.688     0.074

The mclapply took less than half the time the straightforward grep took.
R version 4.1.1 (2021-08-10) on Ubuntu 20.04.3 LTS.