rdplyrdtplyr

Selecting and grouping multiple columns in dtplyr vs dplyr


I'd like to group_by across several variables in dtplyr within a lapply loop, and I find that I somehow can't use the same syntax as dplyr after calling lazy_dt().

library(dplyr)
mycolumns= c("Wind", "Month", "Ozone", "Solar.R")
columnpairs <- as.data.frame(combn(mycolumns, 2))

#         V1    V2      V3    V4      V5      V6
#    1  Wind  Wind    Wind Month   Month   Ozone
#    2 Month Ozone Solar.R Ozone Solar.R Solar.R

result_dplyr <- lapply(columnpairs, function(x) {
  airquality %>% 
    select(all_of(x)) %>% 
    group_by(across(all_of(x))) %>% filter(n() > 1)
  }
)

$V1
# A tibble: 105 x 2
# Groups:   Wind, Month [40]
    Wind Month
   <dbl> <int>
 1   7.4     5
 2   8       5
 3  11.5     5
 4  14.9     5
 5   8.6     5
 6   8.6     5
 7   9.7     5
 8  11.5     5
 9  12       5
10  11.5     5
# ... with 95 more rows

Using the same syntax, I encounter an issue after calling lazy_dt with dtplyr.

library(dtplyr)
airq <- lazy_dt(airquality)

lapply(columnpairs, function(x) {
  airq %>% select(all_of(x)) %>% 
    group_by(across(all_of(x))) %>% filter(n() > 1)
})

Error in `all_of()`:
! object 'x' not found

Any idea?

EDIT: issue created at https://github.com/tidyverse/dtplyr/issues/383


Solution

  • It seems that the method for group_by with dtplyr (group_by.dtplyr_step) is creating the issue.

    > methods('group_by')
    [1] group_by.data.frame*  group_by.data.table*  group_by.dtplyr_step*
    

    Not sure if it is a bug or not.

    > traceback()
    ...
    6: group_by.dtplyr_step(., across(all_of(.x)))  ###
    5: group_by(., across(all_of(.x)))
    4: filter(., n() > 1)
    3: airq %>% select(all_of(.x)) %>% group_by(across(all_of(.x))) %>% 
           filter(n() > 1)
    2: .f(.x[[i]], ...)
    1: map(columnpairs, ~airq %>% select(all_of(.x)) %>% group_by(across(all_of(.x))) %>% 
           filter(n() > 1))
    

    Here are two methods that are working

    1. Using the deprecated group_by_at
    2. Converting to syms and then evaluate (!!!)
    Using group_by_at
    library(dtplyr)
    library(purrr)
    library(dplyr)
    map(columnpairs, ~ airq %>%
            select(all_of(.x)) %>%
            group_by_at(all_of(.x)) %>%
            filter(n() > 1))
    $V1
    Source: local data table [105 x 2]
    Groups: Wind, Month
    Call:
      _DT2 <- `_DT1`[, .(Wind, Month)]
      `_DT2`[`_DT2`[, .I[.N > 1], by = .(Wind, Month)]$V1]
    
       Wind Month
      <dbl> <int>
    1   7.4     5
    2   7.4     5
    3   8       5
    4   8       5
    5  11.5     5
    6  11.5     5
    # … with 99 more rows
    ...
    
    
    Converting to symbols and evaluate
    map(columnpairs, ~ airq %>% 
          select(all_of(.x)) %>%
          group_by(!!! rlang::syms(.x)) %>% 
          filter(n() > 1))
    $V1
    Source: local data table [105 x 2]
    Groups: Wind, Month
    Call:
      _DT20 <- `_DT1`[, .(Wind, Month)]
      `_DT20`[`_DT20`[, .I[.N > 1], by = .(Wind, Month)]$V1]
    
       Wind Month
      <dbl> <int>
    1   7.4     5
    2   7.4     5
    3   8       5
    4   8       5
    5  11.5     5
    6  11.5     5
    # … with 99 more rows
    
    # Use as.data.table()/as.data.frame()/as_tibble() to access results
    
    $V2
    ...