rmissing-dataimputationimputets

how to fill missing values in a vector with the mean of value before and after the missing one


Currently I am trying to impute values in a vector in R. The conditions of the imputation are.

# example one
input_one = c(1,NA,3,4,NA,6,NA,NA)

# example two
input_two = c(NA,NA,3,4,5,6,NA,NA)

# example three
input_three = c(NA,NA,3,4,NA,6,NA,NA)

I started out to write code to detect the values which can be imputed. But I got stuck with the following.

# incomplete function to detect the values
sapply(split(!is.na(input[c(rbind(which(is.na(c(input)))-1, which(is.na(c(input)))+1))]), 
             rep(1:(length(!is.na(input[c(which(is.na(c(input)))-1, which(is.na(c(input)))+1)]))/2), each = 2)), all)

This however only detects the NAs which might be imputable and it only works with example one. It is incomplete and unfortunately super hard to read and understand.

Any help with this would be highly appreciated.


Solution

  • We can use dplyrs lag and lead functions for that:

    input_three = c(NA,NA,3,4,NA,6,NA,NA)
    
    library(dplyr)
    ifelse(is.na(input_three) & lead(input_three) > lag(input_three),
           (lag(input_three)  + lead(input_three))/ 2,
           input_three)
    

    Retrurns:

    [1] NA NA  3  4  5  6 NA NA
    

    Edit

    Explanation:

    We use ifelse which is the vectorized version of if. I.e. everything within ifelse will be applied to each element of the vectors. First we test if the elements are NA and if the following element is > than the previous. To get the previous and following element we can use dplyr lead and lag functions:

    lag offsets a vector to the right (default is 1 step):

    lag(1:5)
    

    Returns:

    [1] NA  1  2  3  4
    

    lead offsets a vector to the left:

    lead(1:5)
    

    Returns:

    [1]  2  3  4  5 NA
    

    Now to the 'test' clause of ifelse:

    is.na(input_three) & lead(input_three) > lag(input_three)
    

    Which returns:

    [1]    NA    NA FALSE FALSE  TRUE FALSE    NA    NA
    

    Then if the ifelse clause evaluates to TRUE we want to return the sum of the previous and following element divided by 2, othrwise return the original element