rhpcffffbase

how to speed up checking duplication for huge ffdf


I have a list of ffdf, it takes up about 76GB of RAM if it is loaded to RAM instead of using ff package. The following is their respective dim()

> ffdfs |> sapply(dim)
         [,1]     [,2]     [,3]      [,4]      [,5]      [,6]      [,7]
[1,] 11478746 12854627 10398332 404567958 490530023 540375993 913792256
[2,]        3        3        3         3         3         3         3
         [,8]     [,9]     [,10]     [,11]    [,12]     [,13]     [,14]
[1,] 15296863 11588739 547337574 306972654 11544523 255644408 556900805
[2,]        3        3         3         3        3         3         3
        [,15]     [,16]    [,17]
[1,] 13409223 900436690 15184264
[2,]        3         3        3

I want to check the number of duplication in each ffdf, so I did the following:

check_duplication <- sample_cols |> sapply(function(df) {
    df[c("chr","pos")] |> duplicated() |> sum()
})

It works but it is extremely slow.

I am on a HPC, I have about 110GB RAM and 18CPU.

Will there be any other option or setting I could adjust to speed up the process? Thank you.


Solution

  • Parallelization is a natural way to speed this up. It can be done at C level via data.table:

    library("data.table")
    
    data.table 1.14.2 using 4 threads (see ?getDTthreads).  Latest news: r-datatable.com
    
    set.seed(1L)
    x <- as.data.frame(replicate(2L, sample.int(100L, size = 1e+06L, replace = TRUE), simplify = FALSE))
    y <- as.data.table(x)
    microbenchmark::microbenchmark(duplicated(x), duplicated(y), times = 1000L)
    
    Unit: milliseconds
              expr       min         lq       mean     median         uq       max neval
     duplicated(x) 449.27693 596.242890 622.160423 625.610267 644.682319 734.39741  1000
     duplicated(y)   5.75722   6.347518   7.413925   6.874593   7.407695  58.12131  1000
    

    The benchmark here shows that duplicated is much faster when applied to a data.table instead of an equivalent data frame. Of course, how much faster depends on the number of CPUs that you make available to data.table (see ?setDTthreads).

    If you go the data.table route, then you would process your 17 data frames like so:

    nduped <- function(ffd) {
      x <- as.data.frame(ffd[c("chr", "pos")])
      setDT(x)
      n <- sum(duplicated(x))
      rm(x)
      gc(FALSE)
      n
    }
    vapply(list_of_ffd, nduped, 0L)
    

    Here, we are using setDT rather than as.data.table to perform an in-place coercion from data frame to data.table, and we are using rm and gc to free the memory occupied by x before reading another data frame into memory.

    If, for whatever reason, data.table is not an option, then you can stick to using the duplicated method for data frames, namely duplicated.data.frame. It is not parallelized at C level, so you would need to parallelize at R level, using, e.g., mclapply to assign your 17 data frames to batches and process those batches in parallel:

    nduped <- function(ffd) {
      x <- as.data.frame(ffd[c("chr", "pos")])
      n <- sum(duplicated(x))
      rm(x)
      gc(FALSE)
      n
    }
    unlist(parallel::mclapply(list_of_ffd, nduped, ...))
    

    This option is slower and consumes more memory than you might expect. Fortunately, there is room for optimization. The rest of this answer highlights some of the main issues and ways to get around them. Feel free to stop reading if you've already settled on data.table.