rdplyr

dplyr group by external variable


I have legacy code that makes extensive use of dplyr to group by programatically specified variables in a way that is currently either deprecated or superseded. Reproducible examples are given below. I would like to update this code with a stable option to ensure that this continues to work with future versions of dplyr. There appear to be several alternative methods that can be demonstrated to give the same result as the original code in a simple case, but would like to know if these are truly equivalent in edge cases. Skipping over the early years necessitating the use quo, enquo, sym, !!, !!! etc. to get around the challenges of programming with NSE, the first example is group_by_(), as in:

library(dplyr)
Var1 <- "gear"
Var2 <- "cyl"

test1 <- mtcars %>% 
group_by_(Var1, Var2) %>% 
summarise(Mean_mpg = mean(mpg))

This worked fine, and still appears to do so, but comes up with a warning that group_by_() was deprecated in dplyr 0.7.0.

The next option used in some legacy code is:

test2 <- mtcars %>% 
group_by_at(c(Var1, Var2)) %>% 
summarise(Mean_mpg = mean(mpg))

This also runs OK, but the documentation lists this as superseded, and suggests the use of across(). Following that advice:

test3 <- mtcars %>% 
group_by(across(c(Var1, Var2))) %>% 
summarise(Mean_mpg = mean(mpg))

This works, but gives the warning: "Using an external vector in selections was deprecated in tidyselect 1.1.0. ℹ Please use all_of() or any_of() instead."

Taking this advice (maybe should be worded "as well" not "instead"?):

test4 <- mtcars %>% 
group_by(across(all_of(c(Var1, Var2)))) %>% 
summarise(Mean_mpg = mean(mpg))

The "Programming with dplyr" vignette introduces yet another way of doing this:

test5 <- mtcars %>% 
group_by(across(c({{Var1}}, {{Var2}}))) %>% 
summarise(Mean_mpg = mean(mpg))

All five of these give identical results in dplyr version 1.1.4 for this simple case:

sapply(list(test2, test3, test4, test5), identical, test1)

I appreciate that across() etc. have widespread other uses, but just for the purpose of passing a small number of variables to a grouping function, are there specific under-the-hood reasons (performance, error-trapping etc.) that mean that working production code of the form in test1 and test2 should be updated, and if so what is the latest preferred form? In other words, is:

group_by_(Var1, Var2)

the same as:

group_by(across(all_of(c(Var1, Var2))))?

Also, I know this is an impossible question to answer definitively, but does anyone have an inside track into how long group_by_() and group_by_at() are likely to be around, i.e. at what point will legacy code containing these will start to fail?


Solution

  • In your use case, you don't need group_by(), you can add a .by = inside summarise(). Also, you don't need across() either. So this is an option:

    library(dplyr)
    
    Var1 <- "gear"
    Var2 <- "cyl"
    
    mtcars |>
      summarise(Mean_mpg = mean(mpg), .by = all_of(c(Var1, Var2)))
    
    #   gear cyl Mean_mpg
    # 1    4   6   19.750
    # 2    4   4   26.925
    # 3    3   6   19.750
    # 4    3   8   15.050
    # 5    3   4   21.500
    # 6    5   4   28.200
    # 7    5   8   15.400
    # 8    5   6   19.700