I am trying to use R's dplyr
package to create multiple new columns for each year in my dataset that is the sum of the columns corresponding to each year's end of quarter figures (Mar, Jun, Sep, Dec). The only way I have been able to figure out how to do this "efficiently" is with a for loop. But something tells me that there's an alternative, more efficient, or better way of going about this (maybe I should be using a map function here, but I'm just not sure?). Here's a toy example that can be reproduced:
library(tidyverse)
library(glue)
# Create a toy example and print the resulting tibble
set.seed(100) # make results reproducible by setting seed
vars <- c("AgeGroup", paste0(month.abb[seq(3, 12, 3)], "_", rep(15:17, each = 4)))
(df <- cbind(LETTERS[1:5], matrix(rpois(n = (length(vars) - 1) * 5, 30), nrow = 5)) %>%
data.frame() %>%
setNames(vars) %>%
tibble() %>%
mutate(across(-1, as.integer))
)
which sets up the example/reproducible dataset as:
# A tibble: 5 × 13
AgeGroup Mar_15 Jun_15 Sep_15 Dec_15 Mar_16 Jun_16 Sep_16 Dec_16 Mar_17 Jun_17 Sep_17 Dec_17
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 A 27 26 33 36 34 25 27 37 37 32 37 30
2 B 21 32 24 31 25 39 32 20 30 32 25 26
3 C 34 28 30 23 25 29 35 26 19 30 28 29
4 D 30 32 29 34 31 29 35 37 28 34 31 50
5 E 31 33 27 31 23 26 29 28 28 26 19 37
So what I would like to do is create one new variable for each year ('15, '16, and '17) called sum_15
, sum_16
, and sum_17
which are the sum of all the months' values from variables that end with the corresponding two digit year (e.g. ends_with("15")
, ends_with("16")
, ends_with("17")
).
I have been able to achieve the desired result with the following code, but I'd rather not use a loop, if I can get away with the judicious application of an across
statement or perhaps a map
function (or some other method you all might suggest):
# This works, but I'd rather not use a for loop if I can avoid it:
for (i in 15:17) {
df <- df %>% mutate("sum_{i}" := rowSums(across(ends_with(glue("_{i}")))))
}
#write out the df that displays what I am trying to achieve
df %>% select(AgeGroup, starts_with("sum"))
# A tibble: 5 × 4
AgeGroup sum_15 sum_16 sum_17
<chr> <dbl> <dbl> <dbl>
1 A 122 123 136
2 B 108 116 113
3 C 115 115 106
4 D 125 132 143
5 E 122 106 110
I looked through other examples on SO, but all of the examples I have found are overly simplistic and seem to create only a single variable at a time by manually creating them in the mutate statement—something along the lines of:
df %>% mutate(sum15 = rowSums(across(ends_with("_15"))),
sum16 = rowSums(across(ends_with("_16"))),
sum17 = rowSums(across(ends_with("_17"))),
)
This is obviously not what I'm looking for since this is basically a more manual way of doing what I'm already doing with the for loop.
Could anyone offer any suggestions on how to improve this code and avoid the for loop?
Thank you so much!
Another way would be:
df %>%
pivot_longer(-AgeGroup, names_pattern = "(\\d+)")%>%
pivot_wider(values_fn = sum, names_prefix = 'Sum_')
# A tibble: 5 × 4
AgeGroup Sum_15 Sum_16 Sum_17
<chr> <int> <int> <int>
1 A 122 123 136
2 B 108 116 113
3 C 115 115 106
4 D 125 132 143
5 E 122 106 110
And then you can join to the original df
If you do not have an idea about the names_pattern
you can use names_sep
:
df %>%
pivot_longer(-AgeGroup, names_to = c(NA, 'name'), names_sep = "_")%>%
pivot_wider(values_fn = sum, names_prefix = 'Sum_')
# A tibble: 5 × 4
AgeGroup Sum_15 Sum_16 Sum_17
<chr> <int> <int> <int>
1 A 122 123 136
2 B 108 116 113
3 C 115 115 106
4 D 125 132 143
5 E 122 106 110
in base R you could do:
sapply(split.default(df[-1], sub(".*_", "Sum_", names(df)[-1])), rowSums)
Sum_15 Sum_16 Sum_17
[1,] 122 123 136
[2,] 108 116 113
[3,] 115 115 106
[4,] 125 132 143
[5,] 122 106 110
Which you could cbind to the original dataframe ie
cbind(df, sapply(split.default(df[-1], sub(".*_", "Sum_", names(df)[-1])), rowSums))