rdplyr

How can I mutate many columns in dplyr without repeating mutate many times?


I am in the process of writing a very not-DRY dplyr chain in R. I need to call dplyr::mutate() and dplyr::percent_rank() function on a lot of columns from my dataframe, and it would be helpful for me to not have a line of code for each call. My dataframe columns that I need to compute percentiles for have the following pattern:

regions <- c("atr2", "sht2", "mid2", "lng2", "all2", "sht3", "lng3", "all3")
suffixes <- c("Made", "Att", "AttFreq", "Pct")
for(i in regions) {
  for(j in suffixes) {
    print(paste0(i, j))
  }
}

In the example above I need 8 * 4 == 32 different percentile columns. All 32 initial columns atr2Made, atr2Att, etc. are already in my dataframe. To compute the percentiles, I have been doing the following:

pctile.lineup.data <- pctile.lineup.data %>%
    dplyr::group_by(season) %>%
    # dplyr::group_by(season, homeConfId) %>%
    dplyr::mutate(atr2MadeRankNcaa = round(100 * dplyr::percent_rank(atr2Made))) %>%
    dplyr::mutate(atrAttRankNcaa = round(100 * dplyr::percent_rank(atr2Att))) %>%
    dplyr::mutate(atr2AttFreqRankNcaa = round(100 * dplyr::percent_rank(atr2AttFreq))) %>%
    dplyr::mutate(atr2PctRankNcaa = round(100 * dplyr::percent_rank(atr2Pct))) %>%
    dplyr::mutate(sht2MadeRankNcaa = round(100 * dplyr::percent_rank(sht2Made))) %>%
    dplyr::mutate(shtAttRankNcaa = round(100 * dplyr::percent_rank(sht2Att))) %>%
    dplyr::mutate(sht2AttFreqRankNcaa = round(100 * dplyr::percent_rank(sht2AttFreq))) %>%
    dplyr::mutate(sht2PctRankNcaa = round(100 * dplyr::percent_rank(sht2Pct))) %>%
    dplyr::mutate(mid2MadeRankNcaa = round(100 * dplyr::percent_rank(mid2Made))) %>%
    dplyr::mutate(midAttRankNcaa = round(100 * dplyr::percent_rank(mid2Att))) %>%
    dplyr::mutate(mid2AttFreqRankNcaa = round(100 * dplyr::percent_rank(mid2AttFreq))) %>%
    dplyr::mutate(mid2PctRankNcaa = round(100 * dplyr::percent_rank(mid2Pct))) %>%
    ... %>%
    dplyr::ungroup()

Not only do I need 32 different mutate() functions, I need to run this code twice for the 2 different group_by()s (see the 2nd one commented out). Is there a better way than 64 lines of code? I have a separate datarame that has 21 regions instead of 8, with the same 4 suffixes, and same 2 group_by()s, therefore it would require 21 * 4 * 2 == 168 lines of code to compute these percentiles. This isn't DRY - please help!

Edit: I am obviously looking into mutate_at, however I am not very familiar / good with the _at version of mutate. There are other columns in my dataframe besides these 32, and so I don't think mutate_all would work.


Solution

  • As of dplyr version 1.0.0, the scoped versions of verbs such as mutate_at which the earlier version of this answer discussed have been deprecated in favor of the dplyr::across function, which is simpler and lets you do this right in dplyr::mutate without using a separate function. Starting with some sample data:

    df <- data.frame(name = LETTERS[1:5],
                     item1 = rnorm(5, mean=2),
                     item2 = rnorm(5, mean=5),
                     item3 = rnorm(5, mean=7))
    

    The dplyr::across function goes inside mutate and accepts 2 main arguments:

    df %>%
        mutate(across(one_of('item1', 'item2'),
                      .fns = list(rounded = ~ round(100 * percent_rank(.x)))))
    
      name     item1    item2    item3 item1_rounded item2_rounded
    1    A 2.0825275 6.445983 7.373511            50            75
    2    B 1.2568069 4.715137 8.282489            25            50
    3    C 3.8895454 6.486809 5.426263           100           100
    4    D 0.6094173 3.645558 6.975673             0             0
    5    E 2.1202091 4.488883 6.168427            75            25
    

    If you want to apply multiple functions to these columns, just add more functions to the list.

    Since the function in .fns is named (rounded = ...), the result of that operation is made into new variables with the with that name as a suffix. If it was unnamed, then the output would be numbered (ie. item1_1 and item2_1)

    You can also use the new .names argument to provide a glue description to specify how the new column names are assembled