rdata.tabledata-manipulationmissing-dataimputets

impute missing with interpolation by groups


I am trying to impute missing value NA with interpolation by multiple groups.

I just subset a simple example:

Year    ST   CC   ID     MP     PS 
2002    15   3     3     NA    1.5
2003    15   3     3     NA    1.5
2004    15   3     3    193    1.5
2005    15   3     3    193    1.5
2006    15   3     3    348    1.5
2007    15   3     3    388    1.5
2008    15   3     3    388    1.5
1999    53   33    1     NA    3.4 
2000    53   33    1     NA    3.4        
2002    53   33    1     NA    2.9           
2003    53   33    1     NA    2.6           
2004    53   33    1     NA    2.6         
2005    53   33    1    170    3.8         
2006    53   33    1    170    3.0           
2007    53   33    1    330    4.2         
2008    53   33    1    330    5.0           

I used na.approx() but got wrong. It seems that my data has missing value in the first observation within each group.

setDT(dt)[, MP_interpolate := na.approx(MP, na.rm = T), .(Year, ST, CC, ID)]

setDT(dt)[, MP_interpolate := if(length(na.omit(MP))<2) MP else na.approx(MP, na.rm=TRUE), .(Year, ST, CC, ID)]

I also tried package imputeTS but not work.

Both are incorrect. Does it mean using interpolation to impute missing is not a good idea?

I'm not sure which imputation way is better (Do not want to use mean or median). I'm thinking about using PS trend to impute MP. (Just my thought, not question)


Solution

  • You could try imputeTS::na_kalman in an ave. This also extrapolates, what you probably want.

    library(imputeTS)
    
    dt$MP.imp <- with(dt, ave(MP, ST, CC, ID, FUN=na_kalman))
    #    Year ST CC ID  MP  PS   MP.imp
    # 1  2001 15  3  3  NA 1.5 193.0000
    # 2  2002 15  3  3  NA 1.5 193.0000
    # 3  2003 15  3  3  NA 1.5 193.0000
    # 4  2004 15  3  3 193 1.5 193.0000
    # 5  2005 15  3  3 193 1.5 193.0000
    # 6  2006 15  3  3 348 1.5 348.0000
    # 7  2007 15  3  3 388 1.5 388.0000
    # 8  2008 15  3  3 388 1.5 388.0000
    # 9  1999 53 33  1  NA 3.4 170.2034
    # 10 2000 53 33  1  NA 3.4 166.3867
    # 11 2002 53 33  1  NA 2.9 164.4496
    # 12 2003 53 33  1  NA 2.6 165.0018
    # 13 2004 53 33  1  NA 2.6 168.6527
    # 14 2005 53 33  1 170 3.8 170.0000
    # 15 2006 53 33  1 170 3.0 170.0000
    # 16 2007 53 33  1 330 4.2 330.0000
    # 17 2008 53 33  1 330 5.0 330.0000
    

    You probably want to include year in the ave, it wouldn't have worked otherwise in your sample data.