rmissing-dataimputationlocf

How to replace NA (missing values) in a data frame with neighbouring values


862 2006-05-19 6.241603 5.774208     
863 2006-05-20 NA       NA      
864 2006-05-21 NA       NA      
865 2006-05-22 6.383929 5.906426      
866 2006-05-23 6.782068 6.268758      
867 2006-05-24 6.534616 6.013767      
868 2006-05-25 6.370312 5.856366      
869 2006-05-26 6.225175 5.781617      
870 2006-05-27 NA       NA     

I have a data frame x like above with some NA, which i want to fill using neighboring non-NA values like for 2006-05-20 it will be avg of 19&22

How do it is the question?


Solution

  • Properly formatted your data looks like this

    862 2006-05-19 6.241603 5.774208 
    863 2006-05-20 NA       NA 
    864 2006-05-21 NA       NA 
    865 2006-05-22 6.383929 5.906426 
    866 2006-05-23 6.782068 6.268758 
    867 2006-05-24 6.534616 6.013767 
    868 2006-05-25 6.370312 5.856366 
    869 2006-05-26 6.225175 5.781617 
    870 2006-05-27 NA       NA
    

    and is of a time-series nature. So I would load into an object of class zoo (from the zoo package) as that allows you to pick a number of strategies -- see below. Which one you pick depends on the nature of your data and application. In general, the field of 'figuring missing data out' is called data imputation and there is a rather large literature.

    R> x <- zoo(X[,3:4], order.by=as.Date(X[,2]))
    R> x
                   x     y
    2006-05-19 6.242 5.774
    2006-05-20    NA    NA
    2006-05-21    NA    NA
    2006-05-22 6.384 5.906
    2006-05-23 6.782 6.269
    2006-05-24 6.535 6.014
    2006-05-25 6.370 5.856
    2006-05-26 6.225 5.782
    2006-05-27    NA    NA
    R> na.locf(x)  # last observation carried forward
                   x     y
    2006-05-19 6.242 5.774
    2006-05-20 6.242 5.774
    2006-05-21 6.242 5.774
    2006-05-22 6.384 5.906
    2006-05-23 6.782 6.269
    2006-05-24 6.535 6.014
    2006-05-25 6.370 5.856
    2006-05-26 6.225 5.782
    2006-05-27 6.225 5.782
    R> na.approx(x)  # approximation based on before/after values
                   x     y
    2006-05-19 6.242 5.774
    2006-05-20 6.289 5.818
    2006-05-21 6.336 5.862
    2006-05-22 6.384 5.906
    2006-05-23 6.782 6.269
    2006-05-24 6.535 6.014
    2006-05-25 6.370 5.856
    2006-05-26 6.225 5.782
    R> na.spline(x)   # spline fit ...
                   x     y
    2006-05-19 6.242 5.774
    2006-05-20 5.585 5.159
    2006-05-21 5.797 5.358
    2006-05-22 6.384 5.906
    2006-05-23 6.782 6.269
    2006-05-24 6.535 6.014
    2006-05-25 6.370 5.856
    2006-05-26 6.225 5.782
    2006-05-27 5.973 5.716
    R>