rdata.tablenalocf

Bridge the last and next non-NA value with intermediate values that grow evenly


What would be a good way to fill the missing NAs in a dataframe column with intermediate values that grow gradually from the last non-NA value to the next non-NA value?

Here is an example: for the column cost, I would like to obtain the column cost_esti where the cost increase by $31 each year between 2014 and 2016, bridging the last known cost of $595 to the next known cost of $720

enter image description here

The code I came up with is lengthy. Is there an elegant way to do the same?

library(data.table)
data = data.table(year=2000:2018,
                  cost = c(100,120,NA,200,220,NA,NA,300,350,470,500,NA,NA,595,NA,NA,NA,720,800))

data[,cost_nas:=as.numeric(is.na(cost))]

## consecutive nas so far for each row:
data[, consecutive_nas_so_far := seq_len(.N), by=rleid(cost_nas)]
data[cost_nas==0,consecutive_nas_so_far:=0]

# total number of consecutive nas in the sequence
data[,total_number_of_consec_nas:=ifelse(consecutive_nas_so_far>0&shift(consecutive_nas_so_far,1,type = "lead")==0,consecutive_nas_so_far,NA)]
data[cost_nas==0,total_number_of_consec_nas:=0]
data[,total_number_of_consec_nas:=zoo::na.locf(total_number_of_consec_nas,fromLast=T)]

#get last and next known values for cost:
data[,cost_previous:=zoo::na.locf(cost)]
data[,cost_following:=zoo::na.locf(cost,fromLast=T)]

# apply the formula to calculate the gradual increase from cost_previous to cost_following
data[,cost_esti:=round(consecutive_nas_so_far*(cost_following-cost_previous)/(total_number_of_consec_nas+1)+cost_previous,0)]
data[is.na(cost_esti),cost_esti:=cost]

Solution

  • You can re-write data.table operations using zoo::na.locf and data.table::rleid. Add 2 columns, one each for lastNonNA and nextNonNA using na.locf. rleid will provide you distinct group for continuous NA. Now you can write logic to fill NA using linear between lastNonNA and nextNonNA.

    library(data.table)
    library(zoo)
    #Data
    data = data.table(year=2000:2018,
           cost = c(100,120,NA,200,220,NA,NA,300,350,470,500,NA,NA,595,NA,NA,NA,720,800))
    
    data[,':='(lastNonNA = na.locf(cost, fromLast = FALSE), 
    nextNonNA = na.locf(cost, fromLast = TRUE), Group_NA = rleid(is.na(cost)))][
      ,':='(IDX = 1:.N), by=Group_NA][
      ,':='(cost = ifelse(is.na(cost), lastNonNA + IDX*((nextNonNA - lastNonNA)/(.N+1)),cost)), 
        by=Group_NA][,.(year, cost)]
    
    #    year     cost
    # 1: 2000 100.0000
    # 2: 2001 120.0000
    # 3: 2002 160.0000   #Filled
    # 4: 2003 200.0000
    # 5: 2004 220.0000
    # 6: 2005 246.6667  #Filled 
    # 7: 2006 273.3333  #Filled
    # 8: 2007 300.0000
    # 9: 2008 350.0000
    # 10: 2009 470.0000
    # 11: 2010 500.0000
    # 12: 2011 531.6667 #Filled
    # 13: 2012 563.3333 #Filled
    # 14: 2013 595.0000
    # 15: 2014 626.2500 #Filled 
    # 16: 2015 657.5000 #Filled
    # 17: 2016 688.7500 #Filled
    # 18: 2017 720.0000
    # 19: 2018 800.0000