[ This is also reported on the multidplyr github page ]
I'm trying to use multidplyr_0.0.0.9000 with dplyr_0.7.4.9000 and pmap_dfr from purrr_0.2.4.9000. The following code (without using multidplyr) works fine:
grid1 = as_tibble(expand.grid(m1 = c(1:10), m2 = c(20:30)))
retstuff = function(m1, m2) { return(tribble(~m3, ~m4, m1+1, m2+2)) }
pmap_dfr(grid1, retstuff)
When I try to partition the grid with multidplyr:
grid2 = partition(grid1, m1)
pmap_dfr(grid2, retstuff)
I get the error Error: Element 5 is not a vector (environment)
from pmap_dfr()
I also get the following warning from partition() as also reported on github: group_indices_.grouped_df ignores extra arguments
.
Not sure if that's related or not.
A few issues:
pmap_dfr
call in dplyr::do
after which it works:
library(tidyverse)
library(multidplyr)
grid1 <- as_tibble(expand.grid(m1 = c(1:10), m2 = c(20:30)))
retstuff <- function(m1, m2) {
tribble( ~m3, ~m4,
m1 + 1, m2 + 2)
}
grid2 <- partition(grid1, m1)
#> Initialising 7 core cluster.
#> Warning: group_indices_.grouped_df ignores extra arguments
cluster_library(grid2, 'tidyverse') # load packages on each node
cluster_copy(grid2, retstuff) # copy function to each node
grid2 %>% do(pmap_dfr(., retstuff)) # wrap call in dplyr::do
#> Source: party_df [110 x 3]
#> Groups: m1
#> Shards: 7 [11--22 rows]
#>
#> # S3: party_df
#> m1 m3 m4
#> <int> <dbl> <dbl>
#> 1 9 10 22
#> 2 9 10 23
#> 3 9 10 24
#> 4 9 10 25
#> 5 9 10 26
#> 6 9 10 27
#> 7 9 10 28
#> 8 9 10 29
#> 9 9 10 30
#> 10 9 10 31
#> # ... with 100 more rows
...but for this particular case, while multidplyr is a little faster, plain dplyr::mutate
is quite a lot faster yet, and a lot easier to write:
grid1 %>% mutate(m3 = m1 + 1, m4 = m2 + 2)
#> # A tibble: 110 x 4
#> m1 m2 m3 m4
#> <int> <int> <dbl> <dbl>
#> 1 1 20 2 22
#> 2 2 20 3 22
#> 3 3 20 4 22
#> 4 4 20 5 22
#> 5 5 20 6 22
#> 6 6 20 7 22
#> 7 7 20 8 22
#> 8 8 20 9 22
#> 9 9 20 10 22
#> 10 10 20 11 22
#> # ... with 100 more rows
all.equal(grid2 %>% do(pmap_dfr(., retstuff)) %>% collect,
grid1 %>% mutate(m3 = m1 + 1, m4 = m2 + 2) %>% select(-m2))
#> [1] TRUE
microbenchmark::microbenchmark(
multidplyr_pmap = grid2 %>% do(pmap_dfr(., retstuff)) %>% collect(),
multidplyr_mutate = grid2 %>% mutate(m3 = m1 + 1, m4 = m2 + 2) %>% collect(),
pmap = grid1 %>% pmap_dfr(retstuff),
mutate = grid1 %>% mutate(m3 = m1 + 1, m4 = m2 + 2) %>% select(-m2)
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> multidplyr_pmap 113.896646 117.18365 122.656286 119.75652 125.874450 182.53330 100
#> multidplyr_mutate 12.419918 12.84528 16.271337 13.68441 15.092482 177.77372 100
#> pmap 372.512544 387.49371 397.844622 394.71971 402.640281 551.78633 100
#> mutate 7.014426 7.49689 8.499588 7.66554 8.654478 32.22647 100