rpurrrfurrr

Why is `furrr::future_map_int()` slower than `purrr::map_int()` when I use `dplyr::mutate()`?


I have a tibble that includes a list-column with vectors inside. I want to create a new column that accounts for the length of each vector. Since this dataset is large (3M rows), I thought to shave off some processing time using the furrr package. However, it seems that purrr is faster than furrr. How come?

To demonstrate the problem, I first simulate some data. Don't bother to understand the code in the simulation part as it's irrelevant to the question.


data simulation function

library(stringi)
library(rrapply)
library(tibble)

simulate_data <- function(nrows) {
  split_func <- function(x, n) {
    unname(split(x, rep_len(1:n, length(x))))
  }
  
  randomly_subset_vec <- function(x) {
    sample(x, sample(length(x), 1))
  }
  
  tibble::tibble(
    col_a = rrapply(object = split_func(
      x = setNames(1:(nrows * 5),
                   stringi::stri_rand_strings(nrows * 5,
                                              2)),
      n = nrows
    ),
    f      = randomly_subset_vec),
    col_b = runif(nrows)
  )
  
} 

simulate data

set.seed(2021)

my_data <- simulate_data(3e6) # takes about 1 minute to run on my machine

my_data
## # A tibble: 3,000,000 x 2
##    col_a      col_b
##    <list>     <dbl>
##  1 <int [3]> 0.786 
##  2 <int [5]> 0.0199
##  3 <int [2]> 0.468 
##  4 <int [2]> 0.270 
##  5 <int [3]> 0.709 
##  6 <int [2]> 0.643 
##  7 <int [2]> 0.0837
##  8 <int [4]> 0.159 
##  9 <int [2]> 0.429 
## 10 <int [2]> 0.919 
## # ... with 2,999,990 more rows

the actual problem
I want to mutate a new column (length_col_a) that will account for the length of col_a. I'm going to do this twice. First with purrr::map_int() and then with furrr::future_map_int().

library(dplyr, warn.conflicts = T)
library(purrr)
library(furrr)
library(tictoc)

# first with purrr:
##################
tic()
my_data %>%
  mutate(length_col_a = map_int(.x = col_a, .f = ~length(.x)))

## # A tibble: 3,000,000 x 3
##    col_a      col_b length_col_a
##    <list>     <dbl>        <int>
##  1 <int [3]> 0.786             3
##  2 <int [5]> 0.0199            5
##  3 <int [2]> 0.468             2
##  4 <int [2]> 0.270             2
##  5 <int [3]> 0.709             3
##  6 <int [2]> 0.643             2
##  7 <int [2]> 0.0837            2
##  8 <int [4]> 0.159             4
##  9 <int [2]> 0.429             2
## 10 <int [2]> 0.919             2
## # ... with 2,999,990 more rows
toc()
## 6.16 sec elapsed


# and now with furrr:
####################
future::plan(future::multisession, workers = 2)

tic()
my_data %>%
  mutate(length_col_a = future_map_int(col_a, length))
## # A tibble: 3,000,000 x 3
##    col_a      col_b length_col_a
##    <list>     <dbl>        <int>
##  1 <int [3]> 0.786             3
##  2 <int [5]> 0.0199            5
##  3 <int [2]> 0.468             2
##  4 <int [2]> 0.270             2
##  5 <int [3]> 0.709             3
##  6 <int [2]> 0.643             2
##  7 <int [2]> 0.0837            2
##  8 <int [4]> 0.159             4
##  9 <int [2]> 0.429             2
## 10 <int [2]> 0.919             2
## # ... with 2,999,990 more rows
toc()
## 10.95 sec elapsed

I know tictoc isn't the most accurate way to benchmark, but still -- furrr is supposed to be just faster (as the vignette suggests), but it isn't. I've made sure that the data isn't grouped, since the author explained that furrr doesn't work well with grouped data. Then what other explanation could be for furrr being slower (or not very faster) than purrr?


EDIT


I found this issue on furrr's github repo that discusses almost the same problem. However, the case is different. In the github issue, the function being mapped is a user-defined function that requires attaching additional packages. So the author explains that each furrr worker has to attach the required packages before doing the calculation. By contrast, I map the length() function from base R, so practically there should be no overhead of attaching any packages.

In addition, the author suggests that problems may arise because plan(multisession) wasn't working in RStudio. But updating the parallelly package to dev version solves this problem.

remotes::install_github("HenrikBengtsson/parallelly", ref="develop")

Unfortunately, this update didn't make any difference in my case.


Solution

  • As I have argued in the comments to the original post, my suspicion is that there is an overhead caused by the distribution the very large dataset by the workers.

    To substantiate my suspicion, I have used the same code used by the OP with a single modification: I have added a delay of 0.000001 and the results were: purrr --> 192.45 sec and furrr: 44.707 sec (8 workers). The time taken by furrr was only 1/4 of the one taken by purrr -- very far from 1/8!

    My code is below, as requested by the OP:

    library(stringi)
    library(rrapply)
    library(tibble)
    
    simulate_data <- function(nrows) {
      split_func <- function(x, n) {
        unname(split(x, rep_len(1:n, length(x))))
      }
      
      randomly_subset_vec <- function(x) {
        sample(x, sample(length(x), 1))
      }
      
      tibble::tibble(
        col_a = rrapply(object = split_func(
          x = setNames(1:(nrows * 5),
                       stringi::stri_rand_strings(nrows * 5,
                                                  2)),
          n = nrows
        ),
        f      = randomly_subset_vec),
        col_b = runif(nrows)
      )
      
    } 
    
    set.seed(2021)
    
    my_data <- simulate_data(3e6) # takes about 1 minute to run on my machine
    
    my_data
    
    library(dplyr, warn.conflicts = T)
    library(purrr)
    library(furrr)
    library(tictoc)
    
    # first with purrr:
    ##################
    
    ######## ---->  DELAY <---- ########
    f <- function(x) {Sys.sleep(0.000001); length(x)}
    
    tic()
    my_data %>%
      mutate(length_col_a = map_int(.x = col_a, .f = ~ f(.x)))
    toc()
    
    plan(multisession, workers = 8)
    
    tic()
    my_data %>%
      mutate(length_col_a = future_map_int(col_a, f))
    toc()