I am in the process of writing a very not-DRY dplyr chain in R. I need to call dplyr::mutate() and dplyr::percent_rank()
function on a lot of columns from my dataframe, and it would be helpful for me to not have a line of code for each call. My dataframe columns that I need to compute percentiles for have the following pattern:
regions <- c("atr2", "sht2", "mid2", "lng2", "all2", "sht3", "lng3", "all3")
suffixes <- c("Made", "Att", "AttFreq", "Pct")
for(i in regions) {
for(j in suffixes) {
print(paste0(i, j))
}
}
In the example above I need 8 * 4 == 32
different percentile columns. All 32 initial columns atr2Made
, atr2Att
, etc. are already in my dataframe. To compute the percentiles, I have been doing the following:
pctile.lineup.data <- pctile.lineup.data %>%
dplyr::group_by(season) %>%
# dplyr::group_by(season, homeConfId) %>%
dplyr::mutate(atr2MadeRankNcaa = round(100 * dplyr::percent_rank(atr2Made))) %>%
dplyr::mutate(atrAttRankNcaa = round(100 * dplyr::percent_rank(atr2Att))) %>%
dplyr::mutate(atr2AttFreqRankNcaa = round(100 * dplyr::percent_rank(atr2AttFreq))) %>%
dplyr::mutate(atr2PctRankNcaa = round(100 * dplyr::percent_rank(atr2Pct))) %>%
dplyr::mutate(sht2MadeRankNcaa = round(100 * dplyr::percent_rank(sht2Made))) %>%
dplyr::mutate(shtAttRankNcaa = round(100 * dplyr::percent_rank(sht2Att))) %>%
dplyr::mutate(sht2AttFreqRankNcaa = round(100 * dplyr::percent_rank(sht2AttFreq))) %>%
dplyr::mutate(sht2PctRankNcaa = round(100 * dplyr::percent_rank(sht2Pct))) %>%
dplyr::mutate(mid2MadeRankNcaa = round(100 * dplyr::percent_rank(mid2Made))) %>%
dplyr::mutate(midAttRankNcaa = round(100 * dplyr::percent_rank(mid2Att))) %>%
dplyr::mutate(mid2AttFreqRankNcaa = round(100 * dplyr::percent_rank(mid2AttFreq))) %>%
dplyr::mutate(mid2PctRankNcaa = round(100 * dplyr::percent_rank(mid2Pct))) %>%
... %>%
dplyr::ungroup()
Not only do I need 32 different mutate()
functions, I need to run this code twice for the 2 different group_by()
s (see the 2nd one commented out). Is there a better way than 64 lines of code? I have a separate datarame that has 21 regions instead of 8, with the same 4 suffixes, and same 2 group_by()s, therefore it would require 21 * 4 * 2 == 168 lines of code to compute these percentiles. This isn't DRY - please help!
Edit: I am obviously looking into mutate_at
, however I am not very familiar / good with the _at
version of mutate. There are other columns in my dataframe besides these 32, and so I don't think mutate_all
would work.
As of dplyr version 1.0.0, the scoped versions of verbs such as mutate_at
which the earlier version of this answer discussed have been deprecated in favor of the dplyr::across
function, which is simpler and lets you do this right in dplyr::mutate
without using a separate function. Starting with some sample data:
df <- data.frame(name = LETTERS[1:5],
item1 = rnorm(5, mean=2),
item2 = rnorm(5, mean=5),
item3 = rnorm(5, mean=7))
The dplyr::across
function goes inside mutate
and accepts 2 main arguments:
.cols
argument that accepts the selector functions used by dplyr::select
. In this case we use one_of
to provide a list of variables, but we could simplify this by using contains
or starts_with
if there's a pattern to the variables.fns
argument where we put the function or functions to be applied to each of these columns. This can either be a function object (ie mean
), a function call (ie. ~mean(.x)
or function(x) mean(x)
) or a list of either.df %>%
mutate(across(one_of('item1', 'item2'),
.fns = list(rounded = ~ round(100 * percent_rank(.x)))))
name item1 item2 item3 item1_rounded item2_rounded
1 A 2.0825275 6.445983 7.373511 50 75
2 B 1.2568069 4.715137 8.282489 25 50
3 C 3.8895454 6.486809 5.426263 100 100
4 D 0.6094173 3.645558 6.975673 0 0
5 E 2.1202091 4.488883 6.168427 75 25
If you want to apply multiple functions to these columns, just add more functions to the list.
Since the function in .fns
is named (rounded = ...
), the result of that operation is made into new variables with the with that name as a suffix. If it was unnamed, then the output would be numbered (ie. item1_1
and item2_1
)
You can also use the new .names
argument to provide a glue description to specify how the new column names are assembled