I have the following problem: I have a dataframe with several columns. (See below) I am trying to fill in missing values. Concretely, I only want to fill in values when I have a datapoint before and one after the missing value and when they are equal. Note, I also have different IDs, so I want to do this for each ID. In the example below for example I would like to fill in a 1 for row 2 in year 2017 and in the penultimate row for ID 2 I would like to fill in a 2 (in both cases I have the same value "surrounding" the missing value). I do not want to fill in the last rows' value.
ID | Year | value |
---|---|---|
1 | 2016 | 1 |
1 | 2017 | -8 |
1 | 2018 | 1 |
2 | 2016 | -8 |
2 | 2017 | 2 |
2 | 2018 | 2 |
2 | 2019 | - 8 |
2 | 2020 | 2 |
3 | 2017 | 4 |
3 | 2018 | 4 |
3 | 2019 | -9 |
I am somewhat clueless as to how to approach this problem. I have tried using group_by but I don't see a clear way to do it.
Thank you so much for the help!
Using an if_else
and lead
and lag
you could do:
library(dplyr, w = FALSE)
dat |>
group_by(ID) |>
mutate(value = if_else(value < 0 &
(lag(value, default = "-99") == lead(value, default = "-999")),
lag(value), value
)) |>
ungroup()
#> # A tibble: 11 × 3
#> ID Year value
#> <int> <int> <chr>
#> 1 1 2016 1
#> 2 1 2017 1
#> 3 1 2018 1
#> 4 2 2016 -8
#> 5 2 2017 2
#> 6 2 2018 2
#> 7 2 2019 2
#> 8 2 2020 2
#> 9 3 2017 4
#> 10 3 2018 4
#> 11 3 2019 -9
Note 1: Instead of the default NA
I used a -99
and a -999
as the default
for lag
and lead
. Otherwise the if_else
would replace some values with an NA
.
Note 2:: In row 7 of your example data the missing value was a - 8
. I assumed that this was a typo and replaced it by a -8
.
DATA
dat <- data.frame(
stringsAsFactors = FALSE,
ID = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L),
Year = c(
2016L, 2017L, 2018L, 2016L,
2017L, 2018L, 2019L, 2020L, 2017L, 2018L, 2019L
),
value = c(
"1", "-8", "1", "-8", "2", "2",
"-8", "2", "4", "4", "-9"
)
)