I have a panel dataset in R, which includes observations per group over time (month). The following dataframe is a snapshot of the complete dataframe:
df <- data.frame(group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),month = c("January", "January", "January", "February", "February", "February", "March", "March", "March", "January", "January", "February", "February", "March", "March"),first_value = c("A","BC","D", NA,NA,NA, "D","G","H", "K","L", NA,NA, "DE","GH"),second_value = c(1,5,7, NA,NA,NA, 2,3,9, 7,1, NA,NA, 4,4))
The dataset is already arranged by group and time. As you can see, observations ("first_value*"* and *"*second_value") can be completely empty for a group in a given month (here February, but can be any month except the first and the last month for every group). What I want to achieve is that the empty months are filled with the last non-empty previous month within a group.
I want to get the following dataframe:
df_filled <- data.frame(group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),month = c("January", "January", "January", "February", "February", "February", "March", "March", "March", "January", "January", "February", "February", "March", "March"),first_value = c("A","BC","D", "A","BC","D", "D","G","H", "K","L", "K","L", "DE","GH"),second_value = c(1,5,7, 1,5,7, 2,3,9, 7,1, 7,1, 4,4))
Please note that, by construction, the last non-empty previous month always has the same number of observations than the following empty months.
I tried different commands with fill() from the dplyr package and na.locf () from the zoo package but all I achieved was filling down the last row of the last non-empty previous month, so that
df_filled <- data.frame(group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), month = c("January", "January", "January", "February", "February", "February", "March", "March", "March", "January", "January", "February", "February", "March", "March"), first_value = c("A","BC","D", "D","D","D", "D","G","H", "K","L", "L","L", "DE","GH"), second_value = c(1,5,7, 7,7,7, 2,3,9, 7,1, 1,1, 4,4))
Looking forward to your suggestions. thanks.
An approach using row_number
, assuming there are no 2 consecutive month with NA
.
library(dplyr)
df %>%
mutate(n_na = sum(is.na(first_value)), .by = c(group, month)) %>%
mutate(across(ends_with("_value"), ~
if_else(is.na(.x), .x[row_number() - n_na], .x)), .by = group,
n_na = NULL)
group month first_value second_value
1 1 January A 1
2 1 January BC 5
3 1 January D 7
4 1 February A 1
5 1 February BC 5
6 1 February D 7
7 1 March D 2
8 1 March G 3
9 1 March H 9
10 2 January K 7
11 2 January L 1
12 2 February K 7
13 2 February L 1
14 2 March DE 4
15 2 March GH 4
If consecutive NA
months are possible with this approach it takes a bit of group juggling
df %>%
mutate(n_na = sum(is.na(first_value)), .by = c(group, month)) %>%
mutate(n_lag = lag(n_na, default=0), .by = group) %>%
mutate(n_lag = n_na + first(n_lag),
n_na = if_else(n_na != 0, n_lag, n_na),
n_lag = NULL, .by = c(group, month)) %>%
mutate(across(ends_with("_value"), ~
if_else(is.na(.x), .x[row_number() - n_na], .x)), .by = group,
n_na = NULL)