I have a dataframe
representing paneldata
with ids and time. For each id and year I like to calculate a cummulative sum. But id and time groups are repeated.
df <- data.frame(id = rep(12345,15),
year = c(rep(2021,5), rep(2022,5), rep(2023,5)),
value = c(rep(5,5), rep(3,5), rep(4,5)))
df
id year value
1 12345 2021 5
2 12345 2021 5
3 12345 2021 5
4 12345 2021 5
5 12345 2021 5
6 12345 2022 3
7 12345 2022 3
8 12345 2022 3
9 12345 2022 3
10 12345 2022 3
11 12345 2023 4
12 12345 2023 4
13 12345 2023 4
14 12345 2023 4
15 12345 2023 4
So instead of this
df %>%
group_by(id, year) %>%
mutate(value_cumsum = cumsum(value))
# A tibble: 15 × 4
# Groups: id, year [3]
id year value value_cumsum
<dbl> <dbl> <dbl> <dbl>
1 12345 2021 5 5
2 12345 2021 5 10
3 12345 2021 5 15
4 12345 2021 5 20
5 12345 2021 5 25
6 12345 2022 3 3
7 12345 2022 3 6
8 12345 2022 3 9
9 12345 2022 3 12
10 12345 2022 3 15
11 12345 2023 4 4
12 12345 2023 4 8
13 12345 2023 4 12
14 12345 2023 4 16
15 12345 2023 4 20
I need value_cumsum
to be 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 12, 12, 12, 12, 12.
you can try
df %>%
mutate(value_cumsum = value * !duplicated(value), .by = c("id", "year")) %>%
mutate(value_cumsum = cumsum(value_cumsum))
or
df %>%
distinct() %>%
mutate(value_cumsum = cumsum(value)) %>%
right_join(df)
which gives
id year value value_cumsum
1 12345 2021 5 5
2 12345 2021 5 5
3 12345 2021 5 5
4 12345 2021 5 5
5 12345 2021 5 5
6 12345 2022 3 8
7 12345 2022 3 8
8 12345 2022 3 8
9 12345 2022 3 8
10 12345 2022 3 8
11 12345 2023 4 12
12 12345 2023 4 12
13 12345 2023 4 12
14 12345 2023 4 12
15 12345 2023 4 12