rdataframegroup-bydata-scrubbing

Fill missing values in dataframe in R


I have the following problem: I have a dataframe with several columns. (See below) I am trying to fill in missing values. Concretely, I only want to fill in values when I have a datapoint before and one after the missing value and when they are equal. Note, I also have different IDs, so I want to do this for each ID. In the example below for example I would like to fill in a 1 for row 2 in year 2017 and in the penultimate row for ID 2 I would like to fill in a 2 (in both cases I have the same value "surrounding" the missing value). I do not want to fill in the last rows' value.

ID Year value
1 2016 1
1 2017 -8
1 2018 1
2 2016 -8
2 2017 2
2 2018 2
2 2019 - 8
2 2020 2
3 2017 4
3 2018 4
3 2019 -9

I am somewhat clueless as to how to approach this problem. I have tried using group_by but I don't see a clear way to do it.

Thank you so much for the help!


Solution

  • Using an if_else and lead and lag you could do:

    library(dplyr, w = FALSE)
    
    dat |>
      group_by(ID) |>
      mutate(value = if_else(value < 0 &
        (lag(value, default = "-99") == lead(value, default = "-999")),
      lag(value), value
      )) |>
      ungroup()
    #> # A tibble: 11 × 3
    #>       ID  Year value
    #>    <int> <int> <chr>
    #>  1     1  2016 1    
    #>  2     1  2017 1    
    #>  3     1  2018 1    
    #>  4     2  2016 -8   
    #>  5     2  2017 2    
    #>  6     2  2018 2    
    #>  7     2  2019 2    
    #>  8     2  2020 2    
    #>  9     3  2017 4    
    #> 10     3  2018 4    
    #> 11     3  2019 -9
    

    Note 1: Instead of the default NA I used a -99 and a -999 as the default for lag and lead. Otherwise the if_else would replace some values with an NA.

    Note 2:: In row 7 of your example data the missing value was a - 8. I assumed that this was a typo and replaced it by a -8.

    DATA

    dat <- data.frame(
      stringsAsFactors = FALSE,
      ID = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L),
      Year = c(
        2016L, 2017L, 2018L, 2016L,
        2017L, 2018L, 2019L, 2020L, 2017L, 2018L, 2019L
      ),
      value = c(
        "1", "-8", "1", "-8", "2", "2",
        "-8", "2", "4", "4", "-9"
      )
    )