I have a tibble that includes a list-column with vectors inside. I want to create a new column that accounts for the length of each vector. Since this dataset is large (3M rows), I thought to shave off some processing time using the furrr
package. However, it seems that purrr
is faster than furrr
. How come?
To demonstrate the problem, I first simulate some data. Don't bother to understand the code in the simulation part as it's irrelevant to the question.
data simulation function
library(stringi)
library(rrapply)
library(tibble)
simulate_data <- function(nrows) {
split_func <- function(x, n) {
unname(split(x, rep_len(1:n, length(x))))
}
randomly_subset_vec <- function(x) {
sample(x, sample(length(x), 1))
}
tibble::tibble(
col_a = rrapply(object = split_func(
x = setNames(1:(nrows * 5),
stringi::stri_rand_strings(nrows * 5,
2)),
n = nrows
),
f = randomly_subset_vec),
col_b = runif(nrows)
)
}
simulate data
set.seed(2021)
my_data <- simulate_data(3e6) # takes about 1 minute to run on my machine
my_data
## # A tibble: 3,000,000 x 2
## col_a col_b
## <list> <dbl>
## 1 <int [3]> 0.786
## 2 <int [5]> 0.0199
## 3 <int [2]> 0.468
## 4 <int [2]> 0.270
## 5 <int [3]> 0.709
## 6 <int [2]> 0.643
## 7 <int [2]> 0.0837
## 8 <int [4]> 0.159
## 9 <int [2]> 0.429
## 10 <int [2]> 0.919
## # ... with 2,999,990 more rows
the actual problem
I want to mutate a new column (length_col_a
) that will account for the length of col_a
. I'm going to do this twice. First with purrr::map_int()
and then with furrr::future_map_int()
.
library(dplyr, warn.conflicts = T)
library(purrr)
library(furrr)
library(tictoc)
# first with purrr:
##################
tic()
my_data %>%
mutate(length_col_a = map_int(.x = col_a, .f = ~length(.x)))
## # A tibble: 3,000,000 x 3
## col_a col_b length_col_a
## <list> <dbl> <int>
## 1 <int [3]> 0.786 3
## 2 <int [5]> 0.0199 5
## 3 <int [2]> 0.468 2
## 4 <int [2]> 0.270 2
## 5 <int [3]> 0.709 3
## 6 <int [2]> 0.643 2
## 7 <int [2]> 0.0837 2
## 8 <int [4]> 0.159 4
## 9 <int [2]> 0.429 2
## 10 <int [2]> 0.919 2
## # ... with 2,999,990 more rows
toc()
## 6.16 sec elapsed
# and now with furrr:
####################
future::plan(future::multisession, workers = 2)
tic()
my_data %>%
mutate(length_col_a = future_map_int(col_a, length))
## # A tibble: 3,000,000 x 3
## col_a col_b length_col_a
## <list> <dbl> <int>
## 1 <int [3]> 0.786 3
## 2 <int [5]> 0.0199 5
## 3 <int [2]> 0.468 2
## 4 <int [2]> 0.270 2
## 5 <int [3]> 0.709 3
## 6 <int [2]> 0.643 2
## 7 <int [2]> 0.0837 2
## 8 <int [4]> 0.159 4
## 9 <int [2]> 0.429 2
## 10 <int [2]> 0.919 2
## # ... with 2,999,990 more rows
toc()
## 10.95 sec elapsed
I know tictoc
isn't the most accurate way to benchmark, but still -- furrr
is supposed to be just faster (as the vignette suggests), but it isn't. I've made sure that the data isn't grouped, since the author explained that furrr
doesn't work well with grouped data. Then what other explanation could be for furrr
being slower (or not very faster) than purrr
?
EDIT
I found this issue on furrr
's github repo that discusses almost the same problem. However, the case is different. In the github issue, the function being mapped is a user-defined function that requires attaching additional packages. So the author explains that each furrr
worker has to attach the required packages before doing the calculation. By contrast, I map the length()
function from base R
, so practically there should be no overhead of attaching any packages.
In addition, the author suggests that problems may arise because plan(multisession)
wasn't working in RStudio. But updating the parallelly
package to dev version solves this problem.
remotes::install_github("HenrikBengtsson/parallelly", ref="develop")
Unfortunately, this update didn't make any difference in my case.
As I have argued in the comments to the original post, my suspicion is that there is an overhead caused by the distribution the very large dataset by the workers.
To substantiate my suspicion, I have used the same code used by the OP with a single modification: I have added a delay of 0.000001
and the results were: purrr --> 192.45 sec
and furrr: 44.707 sec
(8 workers
). The time taken by furrr
was only 1/4 of the one taken by purrr
-- very far from 1/8!
My code is below, as requested by the OP:
library(stringi)
library(rrapply)
library(tibble)
simulate_data <- function(nrows) {
split_func <- function(x, n) {
unname(split(x, rep_len(1:n, length(x))))
}
randomly_subset_vec <- function(x) {
sample(x, sample(length(x), 1))
}
tibble::tibble(
col_a = rrapply(object = split_func(
x = setNames(1:(nrows * 5),
stringi::stri_rand_strings(nrows * 5,
2)),
n = nrows
),
f = randomly_subset_vec),
col_b = runif(nrows)
)
}
set.seed(2021)
my_data <- simulate_data(3e6) # takes about 1 minute to run on my machine
my_data
library(dplyr, warn.conflicts = T)
library(purrr)
library(furrr)
library(tictoc)
# first with purrr:
##################
######## ----> DELAY <---- ########
f <- function(x) {Sys.sleep(0.000001); length(x)}
tic()
my_data %>%
mutate(length_col_a = map_int(.x = col_a, .f = ~ f(.x)))
toc()
plan(multisession, workers = 8)
tic()
my_data %>%
mutate(length_col_a = future_map_int(col_a, f))
toc()