From the data.table
package website, given that:
"many common operations are internally parallelized to use multiple CPU threads"
Map()
is used within a data.table
?The reason for asking is because I have noticed that comparing the same operation on a large dataset (cor.test(x, y)
with x = .SD
and y
being a single column of the dataset), the one using Map()
performs quicker than when furrr::fututre_map2()
is used.
You can use this rather explorative approach and see whether the time elapsed shrinks when more threads are used. Note that on my machine the maximum number of usable threads is just one, so no difference is possible
library(data.table)
dt <- data.table::data.table(a = 1:3,
b = 4:6)
dt
#> a b
#> 1: 1 4
#> 2: 2 5
#> 3: 3 6
data.table::getDTthreads()
#> [1] 1
# No Prallelisation ----------------------------------
data.table::setDTthreads(1)
system.time({
dt[, lapply(.SD,
function(x) {
Sys.sleep(2)
x}
)
]
})
#> user system elapsed
#> 0.009 0.001 4.017
# Parallel -------------------------------------------
# use multiple threads
data.table::setDTthreads(2)
data.table::getDTthreads()
#> [1] 1
# if parallel, elapsed should be below 4
system.time({
dt[, lapply(.SD,
function(x) {
Sys.sleep(2)
x}
)
]
})
#> user system elapsed
#> 0.001 0.000 4.007
# Map -----------------------------------------------
# if parallel, elapsed should be below 4
system.time({
dt[, Map(f = function(x, y) {
Sys.sleep(2)
x},
.SD,
1:2
)
]
})
#> user system elapsed
#> 0.002 0.000 4.005