rimputation

How to impute missing values given a value in another row in R


I need to get the year directly above the null (the earliest file date that has a year value) for each person and then fill in the missing values with that year. I have a method that works using subsetting and assigning values, but it requires looping over the data and with billions of rows this is not feasible. Below there is a mock dataset which shows how the data is structured. Each ID has multiple file dates and may have several years associated with it. For file dates before 3/1/14, the year column is missing.

df <- data.frame(
   ID = c(1,1,1,1,1,1,2,2,2,2),
   file_date = c("12/1/14", "9/1/14", "6/1/14", "3/1/14", "12/1/13", "9/1/13", "9/1/14", "6/1/14", "3/1/14", "12/1/13"),
   year = c(1979, 1965, 1965, 1965, NA, NA, 1982, 1982, 1982, NA)
)

In this data, the first two null values would be filled with 1965 and the last with 1982.


Solution

  • Since you cant use tidyr::fill(), you can try the following approaches using ave.

    If you have access to the zoo package, this is quite simple:

    # zoo::na.locf
    df$Filled_Year_zoo <- ave(df$year, df$ID, FUN = zoo::na.locf0) # thanks to @G. Grothendieck
    
    

    If not, a completely base R approach could be:

    ### Full base R
    df$Filled_Year_base <- ave(df$year, df$ID,
                               FUN = \(x) {
                                 x[which(is.na(x))] <- tail(x[which(!is.na(x))], 1)
                                 x})
    

    Result:

    #    ID file_date year Filled_Year_zoo Filled_Year_base
    # 1   1   12/1/14 1979            1979             1979
    # 2   1    9/1/14 1965            1965             1965
    # 3   1    6/1/14 1965            1965             1965
    # 4   1    3/1/14 1965            1965             1965
    # 5   1   12/1/13   NA            1965             1965
    # 6   1    9/1/13   NA            1965             1965
    # 7   2    9/1/14 1982            1982             1982
    # 8   2    6/1/14 1982            1982             1982
    # 9   2    3/1/14 1982            1982             1982
    # 10  2   12/1/13   NA            1982             1982
    

    (you of course can overwrite the year column, I was just defining new columns for demonstration)