rdataframedplyrrowsum

Summing rows with a dynamic column range specified by another value in that row


I am working with a set of binary data for months of the year similar to the sample tibble below but with tens of thousands of rows. I am trying to add a column to the existing table that sums each row between the month specified by the column Month and the final month Dec. For example, I would sum columns Nov:Dec = 0 for the first row, May:Dec = 8 for the second, Jan:Dec = 11 for the last, etc.

library(tidyverse)

df <- tribble(
  ~Jan,~Feb,~Mar,~Apr,~May,~Jun,~Jul,~Aug,~Sep,~Oct,~Nov,~Dec,~Month,~Expected,
  1,0,0,0,0,0,0,0,0,0,0,0,"Nov",0,
  0,1,1,1,1,1,1,1,1,1,1,1,"May",8,
  0,0,1,1,1,1,1,1,1,1,1,1,"Aug",5,
  0,0,0,1,1,1,1,1,1,1,1,1,"Jan",9,
  0,0,0,0,1,1,1,1,1,1,1,1,"Dec",1,
  0,0,0,0,0,1,1,1,1,1,1,1,"May",7,
  0,0,0,0,0,0,1,1,1,1,1,1,"Jun",6,
  0,0,0,0,0,0,0,1,1,1,1,1,"Jul",5,
  0,0,0,0,0,0,0,0,1,1,1,1,"Feb",4,
  0,0,0,0,0,0,0,0,0,1,1,1,"Oct",3,
  0,0,0,0,0,0,0,0,0,0,1,1,"Mar",2,
  1,1,1,1,1,1,1,1,1,1,1,0,"Jan",11,
)

I tried using the mutate() and sum() functions in a row-wise manner with a dynamic column range based on the value of Month at each respective row n(). However, only the first row value of Month (Nov) gets used--so, instead of creating a new column with the "Expected" sums shown in df, I get the sum of the column range Nov:Dec for each row.

test <- df |> 
  rowwise() |> 
  mutate(
    Actual = sum(c_across(df$Month[n()]:Dec))
  )

> test
# A tibble: 12 × 15
# Rowwise: 
     Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec Month Expected Actual
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>    <dbl>  <dbl>
 1     1     0     0     0     0     0     0     0     0     0     0     0 Nov          0      0
 2     0     1     1     1     1     1     1     1     1     1     1     1 May          8      2
 3     0     0     1     1     1     1     1     1     1     1     1     1 Aug          5      2
 4     0     0     0     1     1     1     1     1     1     1     1     1 Jan          9      2
 5     0     0     0     0     1     1     1     1     1     1     1     1 Dec          1      2
 6     0     0     0     0     0     1     1     1     1     1     1     1 May          7      2
 7     0     0     0     0     0     0     1     1     1     1     1     1 Jun          6      2
 8     0     0     0     0     0     0     0     1     1     1     1     1 Jul          5      2
 9     0     0     0     0     0     0     0     0     1     1     1     1 Feb          4      2
10     0     0     0     0     0     0     0     0     0     1     1     1 Oct          3      2
11     0     0     0     0     0     0     0     0     0     0     1     1 Mar          2      2
12     1     1     1     1     1     1     1     1     1     1     1     0 Jan         11      1

I've reviewed other posts with a similar albeit not the same aim (1, 2, 3), but I am still struggling to find a solution. Any help or suggestions would be much appreciated!


Solution

  • You can convert Month to a ordered factor, and row-sum those columns whose names are greater than or equal to Month.

    library(dplyr)
    
    df %>%
      mutate(Month = ordered(Month, levels = month.abb),
             Result = rowSums(across(Jan:Dec, ~ ifelse(cur_column() >= Month, .x, 0))))
    
    # # A tibble: 12 × 15
    #      Jan   Feb   Mar   Apr   May   Jun   Jul   Aug   Sep   Oct   Nov   Dec Month Expected Result
    #    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <ord>    <dbl>  <dbl>
    #  1     1     0     0     0     0     0     0     0     0     0     0     0 Nov          0      0
    #  2     0     1     1     1     1     1     1     1     1     1     1     1 May          8      8
    #  3     0     0     1     1     1     1     1     1     1     1     1     1 Aug          5      5
    #  4     0     0     0     1     1     1     1     1     1     1     1     1 Jan          9      9
    #  5     0     0     0     0     1     1     1     1     1     1     1     1 Dec          1      1
    #  6     0     0     0     0     0     1     1     1     1     1     1     1 May          7      7
    #  7     0     0     0     0     0     0     1     1     1     1     1     1 Jun          6      6
    #  8     0     0     0     0     0     0     0     1     1     1     1     1 Jul          5      5
    #  9     0     0     0     0     0     0     0     0     1     1     1     1 Feb          4      4
    # 10     0     0     0     0     0     0     0     0     0     1     1     1 Oct          3      3
    # 11     0     0     0     0     0     0     0     0     0     0     1     1 Mar          2      2
    # 12     1     1     1     1     1     1     1     1     1     1     1     0 Jan         11     11
    

    Note: You must use ordered() instead of factor() here because >= is not meaningful for ordinary factors.