rdata.tablefurrr

Is `Map()` when used in a `data.table` parallel? - R


From the data.table package website, given that:

"many common operations are internally parallelized to use multiple CPU threads"

The reason for asking is because I have noticed that comparing the same operation on a large dataset (cor.test(x, y) with x = .SD and y being a single column of the dataset), the one using Map() performs quicker than when furrr::fututre_map2() is used.


Solution

  • You can use this rather explorative approach and see whether the time elapsed shrinks when more threads are used. Note that on my machine the maximum number of usable threads is just one, so no difference is possible

    library(data.table)
    
    dt <- data.table::data.table(a = 1:3,
                                 b = 4:6)
    dt
    #>    a b
    #> 1: 1 4
    #> 2: 2 5
    #> 3: 3 6
    
    data.table::getDTthreads()
    #> [1] 1
    
    # No Prallelisation ----------------------------------
    data.table::setDTthreads(1)
    system.time({
      
      dt[, lapply(.SD,
                  function(x) {
                    Sys.sleep(2)
                    x}
      )
      ]
    })
    #>    user  system elapsed 
    #>   0.009   0.001   4.017
    
    # Parallel -------------------------------------------
    # use multiple threads
    data.table::setDTthreads(2)
    data.table::getDTthreads()
    #> [1] 1
    
    # if parallel, elapsed should be below 4
    system.time({
      
      dt[, lapply(.SD,
                  function(x) {
                    Sys.sleep(2)
                    x}
      )
      ]
    })
    #>    user  system elapsed 
    #>   0.001   0.000   4.007
    
    # Map -----------------------------------------------
    # if parallel, elapsed should be below 4
    system.time({
      
      dt[, Map(f = function(x, y) {
        Sys.sleep(2)
        x},
        .SD,
        1:2
        
      )
      ]
    })
    #>    user  system elapsed 
    #>   0.002   0.000   4.005