rdplyrmultidplyr

how to split by multiple columns when using multidplyr


tl;dr
How do I make "partition" from multiplyr split on multiple columns?

Motivation:
I was unhappy with using 1 of 32 cores for hard-working summarize, so I am trying to use multi-dplyer I am operating on multiple columns.

Example:
The vignette shows grouping by a single column, but when I do that, my other grouping column is not considered.

Code:

library(dplyr)
library(multidplyr)
library(nycflights13)

flights1 <- partition(flights, flight)
flights2 <- summarise(flights1, dep_delay = mean(dep_delay, na.rm = TRUE))
flights3 <- collect(flights2)

So how about splitting on year, month, and day?

This doesn't work for me:

flights1 <- partition(flights, list(year, month, day))
flights2 <- summarise(flights1, dep_delay = mean(dep_delay, na.rm = TRUE))
flights3 <- collect(flights2)

I can't seem to make this work. Can you point to a proper or at least effective way to do this?


Solution

  • According to ?partition, the usage for partition is

    partition(.data, ..., cluster = get_default_cluster())

    where ... are variables to partition by. Instead of passing in a list of variables, pass in each variable separately, i.e.

    partition(flights, year, month, day)