rdplyrtidyversetidyselect

Mutate multiple variable to create multiple new variables


Let's say I have a tibble where I need to take multiple variables and mutate them into new multiple new variables.

As an example, here is a simple tibble:

tb <- tribble(
  ~x, ~y1, ~y2, ~y3, ~z,
  1,2,4,6,2,
  2,1,2,3,3,
  3,6,4,2,1
)

I want to subtract variable z from every variable with a name starting with "y", and mutate the results as new variables of tb. Also, suppose I don't know how many "y" variables I have. I want the solution to fit nicely within tidyverse / dplyr workflow.

In essence, I don't understand how to mutate multiple variables into multiple new variables. I'm not sure if you can use mutate in this instance? I've tried mutate_if, but I don't think I'm using it right (and I get an error):

tb %>% mutate_if(starts_with("y"), funs(.-z))

#Error: No tidyselect variables were registered

Thanks in advance!


Solution

  • Update: dplyr 1.0.0+ has across() function which simplifies this task even further

    Basic usage

    across() has two primary arguments:

    • The first argument, .cols, selects the columns you want to operate on. It uses tidy selection (like select()) so you can pick variables by position, name, and type.
    • The second argument, .fns, is a function or list of functions to apply to each column. This can also be a purrr style formula (or list of formulas) like ~ .x / 2. (This argument is optional, and you can omit it if you just want to get the underlying data; you'll see that technique used in vignette("rowwise").)
    # Control how the names are created with the `.names` argument which 
    # takes a [glue](http://glue.tidyverse.org/) spec:
    tb %>% 
      mutate(
        across(starts_with("y"), ~ .x - z, .names = "mod_{col}")
      )
    #> # A tibble: 3 x 8
    #>       x    y1    y2    y3     z mod_y1 mod_y2 mod_y3
    #>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
    #> 1     1     2     4     6     2      0      2      4
    #> 2     2     1     2     3     3     -2     -1      0
    #> 3     3     6     4     2     1      5      3      1
    
    tb %>% 
      mutate(
        across(num_range(prefix = "y", range = 1:3), ~ .x - z, .names = "mod_{col}")
      )
    #> # A tibble: 3 x 8
    #>       x    y1    y2    y3     z mod_y1 mod_y2 mod_y3
    #>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
    #> 1     1     2     4     6     2      0      2      4
    #> 2     2     1     2     3     3     -2     -1      0
    #> 3     3     6     4     2     1      5      3      1
    
    ### Multiple functions
    tb %>% 
      mutate(
        across(c(matches("x"), contains("z")), ~ max(.x, na.rm = TRUE), .names = "max_{col}"),
        across(c(y1:y3), ~ .x - z, .names = "mod_{col}")
      )
    #> # A tibble: 3 x 10
    #>       x    y1    y2    y3     z max_x max_z mod_y1 mod_y2 mod_y3
    #>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
    #> 1     1     2     4     6     2     3     3      0      2      4
    #> 2     2     1     2     3     3     3     3     -2     -1      0
    #> 3     3     6     4     2     1     3     3      5      3      1
    

    Superseded approach:

    Because you are operating on column names, you need to use mutate_at rather than mutate_if which uses the values within columns

    tb %>% mutate_at(vars(starts_with("y")), funs(. - z))
    #> # A tibble: 3 x 5
    #>       x    y1    y2    y3     z
    #>   <dbl> <dbl> <dbl> <dbl> <dbl>
    #> 1     1     0     2     4     2
    #> 2     2    -2    -1     0     3
    #> 3     3     5     3     1     1
    

    To create new columns, instead of overwriting existing ones, we can give name to funs

    # add suffix
    tb %>% mutate_at(vars(starts_with("y")), funs(mod = . - z))
    #> # A tibble: 3 x 8
    #>       x    y1    y2    y3     z y1_mod y2_mod y3_mod
    #>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
    #> 1     1     2     4     6     2      0      2      4
    #> 2     2     1     2     3     3     -2     -1      0
    #> 3     3     6     4     2     1      5      3      1
    
    # remove suffix, add prefix
    tb %>%
      mutate_at(vars(starts_with("y")),  funs(mod = . - z)) %>%
      rename_at(vars(ends_with("_mod")), funs(paste("mod", gsub("_mod", "", .), sep = "_")))
    #> # A tibble: 3 x 8
    #>       x    y1    y2    y3     z mod_y1 mod_y2 mod_y3
    #>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
    #> 1     1     2     4     6     2      0      2      4
    #> 2     2     1     2     3     3     -2     -1      0
    #> 3     3     6     4     2     1      5      3      1
    

    Edit: In dplyr 0.8.0 or higher versions, funs() will be deprecated (source1 & source2), need to use list() instead

    tb %>% mutate_at(vars(starts_with("y")), list(~ . - z))
    #> # A tibble: 3 x 5
    #>       x    y1    y2    y3     z
    #>   <dbl> <dbl> <dbl> <dbl> <dbl>
    #> 1     1     0     2     4     2
    #> 2     2    -2    -1     0     3
    #> 3     3     5     3     1     1
    
    tb %>% mutate_at(vars(starts_with("y")), list(mod = ~ . - z))
    #> # A tibble: 3 x 8
    #>       x    y1    y2    y3     z y1_mod y2_mod y3_mod
    #>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
    #> 1     1     2     4     6     2      0      2      4
    #> 2     2     1     2     3     3     -2     -1      0
    #> 3     3     6     4     2     1      5      3      1
    
    tb %>%
      mutate_at(vars(starts_with("y")),  list(mod = ~ . - z)) %>%
      rename_at(vars(ends_with("_mod")), list(~ paste("mod", gsub("_mod", "", .), sep = "_")))
    #> # A tibble: 3 x 8
    #>       x    y1    y2    y3     z mod_y1 mod_y2 mod_y3
    #>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>  <dbl>  <dbl>
    #> 1     1     2     4     6     2      0      2      4
    #> 2     2     1     2     3     3     -2     -1      0
    #> 3     3     6     4     2     1      5      3      1
    

    Created on 2018-10-29 by the reprex package (v0.2.1)