I am working with a set of binary data for months of the year similar to the sample tibble below but with tens of thousands of rows. I am trying to add a column to the existing table that sums each row between the month specified by the column Month
and the final month Dec
. For example, I would sum columns Nov:Dec
= 0 for the first row, May:Dec
= 8 for the second, Jan:Dec
= 11 for the last, etc.
library(tidyverse)
df <- tribble(
~Jan,~Feb,~Mar,~Apr,~May,~Jun,~Jul,~Aug,~Sep,~Oct,~Nov,~Dec,~Month,~Expected,
1,0,0,0,0,0,0,0,0,0,0,0,"Nov",0,
0,1,1,1,1,1,1,1,1,1,1,1,"May",8,
0,0,1,1,1,1,1,1,1,1,1,1,"Aug",5,
0,0,0,1,1,1,1,1,1,1,1,1,"Jan",9,
0,0,0,0,1,1,1,1,1,1,1,1,"Dec",1,
0,0,0,0,0,1,1,1,1,1,1,1,"May",7,
0,0,0,0,0,0,1,1,1,1,1,1,"Jun",6,
0,0,0,0,0,0,0,1,1,1,1,1,"Jul",5,
0,0,0,0,0,0,0,0,1,1,1,1,"Feb",4,
0,0,0,0,0,0,0,0,0,1,1,1,"Oct",3,
0,0,0,0,0,0,0,0,0,0,1,1,"Mar",2,
1,1,1,1,1,1,1,1,1,1,1,0,"Jan",11,
)
I tried using the mutate()
and sum()
functions in a row-wise manner with a dynamic column range based on the value of Month at each respective row n()
. However, only the first row value of Month (Nov) gets used--so, instead of creating a new column with the "Expected" sums shown in df, I get the sum of the column range Nov:Dec
for each row.
test <- df |>
rowwise() |>
mutate(
Actual = sum(c_across(df$Month[n()]:Dec))
)
> test
# A tibble: 12 × 15
# Rowwise:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Month Expected Actual
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
1 1 0 0 0 0 0 0 0 0 0 0 0 Nov 0 0
2 0 1 1 1 1 1 1 1 1 1 1 1 May 8 2
3 0 0 1 1 1 1 1 1 1 1 1 1 Aug 5 2
4 0 0 0 1 1 1 1 1 1 1 1 1 Jan 9 2
5 0 0 0 0 1 1 1 1 1 1 1 1 Dec 1 2
6 0 0 0 0 0 1 1 1 1 1 1 1 May 7 2
7 0 0 0 0 0 0 1 1 1 1 1 1 Jun 6 2
8 0 0 0 0 0 0 0 1 1 1 1 1 Jul 5 2
9 0 0 0 0 0 0 0 0 1 1 1 1 Feb 4 2
10 0 0 0 0 0 0 0 0 0 1 1 1 Oct 3 2
11 0 0 0 0 0 0 0 0 0 0 1 1 Mar 2 2
12 1 1 1 1 1 1 1 1 1 1 1 0 Jan 11 1
I've reviewed other posts with a similar albeit not the same aim (1, 2, 3), but I am still struggling to find a solution. Any help or suggestions would be much appreciated!
You can convert Month
to a ordered factor, and row-sum those columns whose names are greater than or equal to Month
.
library(dplyr)
df %>%
mutate(Month = ordered(Month, levels = month.abb),
Result = rowSums(across(Jan:Dec, ~ ifelse(cur_column() >= Month, .x, 0))))
# # A tibble: 12 × 15
# Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Month Expected Result
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <ord> <dbl> <dbl>
# 1 1 0 0 0 0 0 0 0 0 0 0 0 Nov 0 0
# 2 0 1 1 1 1 1 1 1 1 1 1 1 May 8 8
# 3 0 0 1 1 1 1 1 1 1 1 1 1 Aug 5 5
# 4 0 0 0 1 1 1 1 1 1 1 1 1 Jan 9 9
# 5 0 0 0 0 1 1 1 1 1 1 1 1 Dec 1 1
# 6 0 0 0 0 0 1 1 1 1 1 1 1 May 7 7
# 7 0 0 0 0 0 0 1 1 1 1 1 1 Jun 6 6
# 8 0 0 0 0 0 0 0 1 1 1 1 1 Jul 5 5
# 9 0 0 0 0 0 0 0 0 1 1 1 1 Feb 4 4
# 10 0 0 0 0 0 0 0 0 0 1 1 1 Oct 3 3
# 11 0 0 0 0 0 0 0 0 0 0 1 1 Mar 2 2
# 12 1 1 1 1 1 1 1 1 1 1 1 0 Jan 11 11
Note: You must use ordered()
instead of factor()
here because >=
is not meaningful for ordinary factors.