rdplyrcalculated-columnsmutaterowsum

Create multiple columns with R dplyr mutate with across instead of with a loop?


I am trying to use R's dplyr package to create multiple new columns for each year in my dataset that is the sum of the columns corresponding to each year's end of quarter figures (Mar, Jun, Sep, Dec). The only way I have been able to figure out how to do this "efficiently" is with a for loop. But something tells me that there's an alternative, more efficient, or better way of going about this (maybe I should be using a map function here, but I'm just not sure?). Here's a toy example that can be reproduced:

library(tidyverse)
library(glue)

# Create a toy example and print the resulting tibble
set.seed(100) # make results reproducible by setting seed
vars <- c("AgeGroup", paste0(month.abb[seq(3, 12, 3)], "_", rep(15:17, each = 4)))

(df <- cbind(LETTERS[1:5], matrix(rpois(n = (length(vars) - 1) * 5, 30), nrow = 5)) %>% 
    data.frame() %>%
    setNames(vars) %>% 
    tibble() %>% 
    mutate(across(-1, as.integer))
  )

which sets up the example/reproducible dataset as:

# A tibble: 5 × 13
  AgeGroup Mar_15 Jun_15 Sep_15 Dec_15 Mar_16 Jun_16 Sep_16 Dec_16 Mar_17 Jun_17 Sep_17 Dec_17
  <chr>     <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>  <int>
1 A            27     26     33     36     34     25     27     37     37     32     37     30
2 B            21     32     24     31     25     39     32     20     30     32     25     26
3 C            34     28     30     23     25     29     35     26     19     30     28     29
4 D            30     32     29     34     31     29     35     37     28     34     31     50
5 E            31     33     27     31     23     26     29     28     28     26     19     37

So what I would like to do is create one new variable for each year ('15, '16, and '17) called sum_15, sum_16, and sum_17 which are the sum of all the months' values from variables that end with the corresponding two digit year (e.g. ends_with("15"), ends_with("16"), ends_with("17")).

I have been able to achieve the desired result with the following code, but I'd rather not use a loop, if I can get away with the judicious application of an across statement or perhaps a map function (or some other method you all might suggest):

# This works, but I'd rather not use a for loop if I can avoid it:
for (i in 15:17) {
  df <- df %>% mutate("sum_{i}" := rowSums(across(ends_with(glue("_{i}")))))
}

#write out the df that displays what I am trying to achieve
df %>% select(AgeGroup, starts_with("sum"))

# A tibble: 5 × 4
  AgeGroup sum_15 sum_16 sum_17
  <chr>     <dbl>  <dbl>  <dbl>
1 A           122    123    136
2 B           108    116    113
3 C           115    115    106
4 D           125    132    143
5 E           122    106    110

I looked through other examples on SO, but all of the examples I have found are overly simplistic and seem to create only a single variable at a time by manually creating them in the mutate statement—something along the lines of:

df %>% mutate(sum15 = rowSums(across(ends_with("_15"))),
              sum16 = rowSums(across(ends_with("_16"))),
              sum17 = rowSums(across(ends_with("_17"))),
              )

This is obviously not what I'm looking for since this is basically a more manual way of doing what I'm already doing with the for loop.

Could anyone offer any suggestions on how to improve this code and avoid the for loop?

Thank you so much!


Solution

  • Another way would be:

    df %>%
       pivot_longer(-AgeGroup, names_pattern = "(\\d+)")%>%
       pivot_wider(values_fn = sum, names_prefix = 'Sum_')
    
    # A tibble: 5 × 4
      AgeGroup Sum_15 Sum_16 Sum_17
      <chr>     <int>  <int>  <int>
    1 A           122    123    136
    2 B           108    116    113
    3 C           115    115    106
    4 D           125    132    143
    5 E           122    106    110
    

    And then you can join to the original df


    If you do not have an idea about the names_pattern you can use names_sep:

    df %>%
       pivot_longer(-AgeGroup, names_to = c(NA, 'name'), names_sep = "_")%>%
       pivot_wider(values_fn = sum, names_prefix = 'Sum_')
    # A tibble: 5 × 4
      AgeGroup Sum_15 Sum_16 Sum_17
      <chr>     <int>  <int>  <int>
    1 A           122    123    136
    2 B           108    116    113
    3 C           115    115    106
    4 D           125    132    143
    5 E           122    106    110
    

    in base R you could do:

    sapply(split.default(df[-1], sub(".*_", "Sum_", names(df)[-1])), rowSums)
         Sum_15 Sum_16 Sum_17
    [1,]    122    123    136
    [2,]    108    116    113
    [3,]    115    115    106
    [4,]    125    132    143
    [5,]    122    106    110
    

    Which you could cbind to the original dataframe ie

    cbind(df, sapply(split.default(df[-1], sub(".*_", "Sum_", names(df)[-1])), rowSums))