rparallel-processingout-of-memorymclapply

In Parallel processing, select all the rows which contains a specific keyword in r


my data (df) contains ~2,0000K rows and ~5K unique names. For each unique name, I want to select all the rows from df which contains that specific name. For example, the data frame df looks as follows:

id  names
1   A,B,D
2   A,B
3   A
4   B,D
5   C,E
6   C,D,E
7   A,E

I want to select all the rows which contains 'A' (A is among 5K unique names) in the column 'names'. So, the output will be:

id  names
1   A,B,D
2   A,B
3   A
7   A,E

I am trying to do this using parallel processing using mclapply with number of nodes = 20 and 80 GB memory. Still I am getting out of memory issue.

Here is my code to select the rows containing specific name:

subset_select = function(x,df){
  indx <- which(
    rowSums(
      `dim<-`(grepl(x, as.matrix(df), fixed=TRUE), dim(df))
    ) > 0
  )
  new_df = df[indx, ]
  return(new_df)
}

df_subset = subset_select(name,df)

My question is: is there any other way to get the subset of data for each 5K unique names more efficiently (in terms of runtime and memory consumption)? TIA.


Solution

  • Here is a parallelized way with package parallel.
    First, the data set has 2M rows. The following code is meant to show it, not more. See the commented line after scan.

    x <- scan(file = "~/tmp/temp.txt")
    #Read 2000000 items
    df1 <- data.frame(id = seq_along(x), names = x)
    

    Now the code.
    The parallelized mclapply loop breaks the data into chunks of N rows each and processes them independently. Then, the return value inx2 must be unlisted.

    library(parallel)
    
    ncores <- detectCores() - 1L
    pat <- "A"
    
    t1 <- system.time({
      inx1 <- grep(pat, df1$names)
    })
    
    t2 <- system.time({
      N <- 10000L
      iters <- seq_len(ceiling(nrow(df1) / N))
      inx2 <- mclapply(iters, function(k){
        i <- seq_len(N) + (k - 1L)*N
        j <- grep(pat, df1[i, "names"])
        i[j]
      }, mc.cores = ncores)
      inx2 <- unlist(inx2)
    })
    
    identical(df1[inx1, ], df1[inx2, ])  
    #[1] TRUE
    
    rbind(t1, t2)
    #   user.self sys.self elapsed user.child sys.child
    #t1     5.325    0.001   5.371      0.000     0.000
    #t2     0.054    0.093   2.446      3.688     0.074
    

    The mclapply took less than half the time the straightforward grep took.
    R version 4.1.1 (2021-08-10) on Ubuntu 20.04.3 LTS.